Stage 2: Data Cleaning

Raw data is a mess. This stage transforms 5.5 million raw records into organized, queryable tables.

Why Clean Data Matters

Imagine analyzing reviews from one giant spreadsheet with 100+ columns. Every query would:

Read data you don't need
Repeat the same user info for every review they wrote
Make updates painful

Instead, we use a star schema — one central table (reviews) with satellite tables for related information.

The Six Core Tables

Click any table to explore its columns:

Reviews

4.4M rows

User Profiles

4.3M rows

Engagement

4.4M rows

Photos

1.1M rows

Users

1.4M rows

Metadata

4.4M rows

Click any table to see its columns →

Table Details

1. Reviews Table (The Hub)

This is the heart of everything. Each row is one review.

Size: 4,391,587 reviews (after deduplication)

Column	Description
`review_id`	Unique identifier (the "social security number" of reviews)
`product_id`	Which product this reviews
`author_id`	Who wrote it
`rating`	1-5 stars
`review_text`	The actual review
`title`	The headline
`is_recommended`	Would they recommend?
`submitted_at`	When they posted

🔍Key Insight

98% of reviews have actual text, not just star ratings. This is valuable because text reveals the why behind the rating.

2. User Profiles Table

Demographics for each review — who is this person?

Size: 4,312,200 rows

Column	Description	Coverage
`skin_type`	Oily, Dry, Combination, Normal	82.6%
`skin_tone`	Fair, Light, Medium, Tan, Deep	74.8%
`eye_color`	For makeup relevance	~70%
`hair_color`	For hair product relevance	~65%
`age_range`	18-24, 25-34, etc.	15%
`is_incentivized`	Received free product	100%
`is_staff`	Sephora employee	100%

⚠️Data Gaps

Not everyone fills in their profile. Age is especially sparse (only 15%). Our algorithms account for missing data.

3. Engagement Table

How did the community react to each review?

Size: 4,391,587 rows

Column	Description
`helpful_votes`	How many found it useful
`unhelpful_votes`	How many disagreed
`helpfulness_score`	Ratio: helpful / (helpful + unhelpful)

Why it matters: A review with 500 helpful votes is more trustworthy than one with 0. The crowd validates quality.

4. Photos Table

Visual proof of product experience.

Size: 1,131,575 photos

Column	Description
`review_id`	Links to the review
`photo_url`	Where the image is hosted

Key stat: About 20% of reviews include photos. These tend to be more credible — you can't easily fake a before/after photo.

5. Users Table

Aggregate stats per reviewer (across all their reviews).

Size: 1,445,498 unique users

Column	Description
`author_id`	The reviewer
`total_reviews`	How many reviews they've written
`avg_rating`	Their typical rating
`organic_reviews`	Reviews without incentive
`first_review_date`	When they started reviewing

Why it matters: A reviewer with 50 reviews and a 3.8 average is more trustworthy than someone who only reviews once and gives 5 stars.

6. Metadata Table

Technical details about each review's origin.

Size: 4,391,587 rows

Column	Description
`created_at`	When the record was created
`updated_at`	Last modification
`source_client`	Mobile app? Desktop? In-store?
`campaign_id`	If part of a marketing campaign

How Tables Connect

Think of it like a family tree with Reviews at the center:

                    User Profiles
                         │
                    (review_id)
                         │
Users ◄──(author_id)── REVIEWS ──(product_id)──► Products
                         │
                    (review_id)
                         │
              ┌──────────┴──────────┐
              ▼                     ▼
         Engagement              Photos

The key connectors:

review_id — Every satellite table links back to reviews
product_id — Groups all reviews for one product
author_id — Links to user-level statistics

The Deduplication Problem

Raw data had 19.6% duplicates. Why?

Reviews updated over time (same ID, new content)
Products re-scraped during recovery runs
API pagination edge cases

Our solution: Keep the most recent version of each review_id. This preserved 4.4M unique reviews from 5.5M raw records.

Storage Format: Parquet

We save tables as Parquet files — a columnar format that's:

Compressed: 6x smaller than CSV
Fast: Only read columns you need
Typed: Knows the difference between strings and numbers

Combined with DuckDB (an embedded SQL database), we can query 4.4M reviews in seconds without loading everything into memory.

What's Next?

Clean data is ready. Now we add intelligence.

Next: Intelligence Layer → — Three AI models that score every review.

1. Data Collection 3. Intelligence Layer