2. Data Cleaning

Stage 2: Data Cleaning

Raw data is a mess. This stage transforms 5.5 million raw records into organized, queryable tables.

Why Clean Data Matters

Imagine analyzing reviews from one giant spreadsheet with 100+ columns. Every query would:

  • Read data you don't need
  • Repeat the same user info for every review they wrote
  • Make updates painful

Instead, we use a star schema — one central table (reviews) with satellite tables for related information.

The Six Core Tables

Click any table to explore its columns:

Reviews

4.4M rows

User Profiles

4.3M rows

Engagement

4.4M rows

Photos

1.1M rows

Users

1.4M rows

Metadata

4.4M rows

Click any table to see its columns →


Table Details

1. Reviews Table (The Hub)

This is the heart of everything. Each row is one review.

Size: 4,391,587 reviews (after deduplication)

ColumnDescription
review_idUnique identifier (the "social security number" of reviews)
product_idWhich product this reviews
author_idWho wrote it
rating1-5 stars
review_textThe actual review
titleThe headline
is_recommendedWould they recommend?
submitted_atWhen they posted
🔍Key Insight

98% of reviews have actual text, not just star ratings. This is valuable because text reveals the why behind the rating.


2. User Profiles Table

Demographics for each review — who is this person?

Size: 4,312,200 rows

ColumnDescriptionCoverage
skin_typeOily, Dry, Combination, Normal82.6%
skin_toneFair, Light, Medium, Tan, Deep74.8%
eye_colorFor makeup relevance~70%
hair_colorFor hair product relevance~65%
age_range18-24, 25-34, etc.15%
is_incentivizedReceived free product100%
is_staffSephora employee100%
⚠️Data Gaps

Not everyone fills in their profile. Age is especially sparse (only 15%). Our algorithms account for missing data.


3. Engagement Table

How did the community react to each review?

Size: 4,391,587 rows

ColumnDescription
helpful_votesHow many found it useful
unhelpful_votesHow many disagreed
helpfulness_scoreRatio: helpful / (helpful + unhelpful)

Why it matters: A review with 500 helpful votes is more trustworthy than one with 0. The crowd validates quality.


4. Photos Table

Visual proof of product experience.

Size: 1,131,575 photos

ColumnDescription
review_idLinks to the review
photo_urlWhere the image is hosted

Key stat: About 20% of reviews include photos. These tend to be more credible — you can't easily fake a before/after photo.


5. Users Table

Aggregate stats per reviewer (across all their reviews).

Size: 1,445,498 unique users

ColumnDescription
author_idThe reviewer
total_reviewsHow many reviews they've written
avg_ratingTheir typical rating
organic_reviewsReviews without incentive
first_review_dateWhen they started reviewing

Why it matters: A reviewer with 50 reviews and a 3.8 average is more trustworthy than someone who only reviews once and gives 5 stars.


6. Metadata Table

Technical details about each review's origin.

Size: 4,391,587 rows

ColumnDescription
created_atWhen the record was created
updated_atLast modification
source_clientMobile app? Desktop? In-store?
campaign_idIf part of a marketing campaign

How Tables Connect

Think of it like a family tree with Reviews at the center:

                    User Profiles

                    (review_id)

Users ◄──(author_id)── REVIEWS ──(product_id)──► Products

                    (review_id)

              ┌──────────┴──────────┐
              ▼                     ▼
         Engagement              Photos

The key connectors:

  • review_id — Every satellite table links back to reviews
  • product_id — Groups all reviews for one product
  • author_id — Links to user-level statistics

The Deduplication Problem

Raw data had 19.6% duplicates. Why?

  1. Reviews updated over time (same ID, new content)
  2. Products re-scraped during recovery runs
  3. API pagination edge cases

Our solution: Keep the most recent version of each review_id. This preserved 4.4M unique reviews from 5.5M raw records.


Storage Format: Parquet

We save tables as Parquet files — a columnar format that's:

  • Compressed: 6x smaller than CSV
  • Fast: Only read columns you need
  • Typed: Knows the difference between strings and numbers

Combined with DuckDB (an embedded SQL database), we can query 4.4M reviews in seconds without loading everything into memory.


What's Next?

Clean data is ready. Now we add intelligence.

Next: Intelligence Layer → — Three AI models that score every review.