Stage 2: Data Cleaning
Raw data is a mess. This stage transforms 5.5 million raw records into organized, queryable tables.
Why Clean Data Matters
Imagine analyzing reviews from one giant spreadsheet with 100+ columns. Every query would:
- Read data you don't need
- Repeat the same user info for every review they wrote
- Make updates painful
Instead, we use a star schema — one central table (reviews) with satellite tables for related information.
The Six Core Tables
Click any table to explore its columns:
Reviews
4.4M rows
User Profiles
4.3M rows
Engagement
4.4M rows
Photos
1.1M rows
Users
1.4M rows
Metadata
4.4M rows
Click any table to see its columns →
Table Details
1. Reviews Table (The Hub)
This is the heart of everything. Each row is one review.
Size: 4,391,587 reviews (after deduplication)
| Column | Description |
|---|---|
review_id | Unique identifier (the "social security number" of reviews) |
product_id | Which product this reviews |
author_id | Who wrote it |
rating | 1-5 stars |
review_text | The actual review |
title | The headline |
is_recommended | Would they recommend? |
submitted_at | When they posted |
98% of reviews have actual text, not just star ratings. This is valuable because text reveals the why behind the rating.
2. User Profiles Table
Demographics for each review — who is this person?
Size: 4,312,200 rows
| Column | Description | Coverage |
|---|---|---|
skin_type | Oily, Dry, Combination, Normal | 82.6% |
skin_tone | Fair, Light, Medium, Tan, Deep | 74.8% |
eye_color | For makeup relevance | ~70% |
hair_color | For hair product relevance | ~65% |
age_range | 18-24, 25-34, etc. | 15% |
is_incentivized | Received free product | 100% |
is_staff | Sephora employee | 100% |
Not everyone fills in their profile. Age is especially sparse (only 15%). Our algorithms account for missing data.
3. Engagement Table
How did the community react to each review?
Size: 4,391,587 rows
| Column | Description |
|---|---|
helpful_votes | How many found it useful |
unhelpful_votes | How many disagreed |
helpfulness_score | Ratio: helpful / (helpful + unhelpful) |
Why it matters: A review with 500 helpful votes is more trustworthy than one with 0. The crowd validates quality.
4. Photos Table
Visual proof of product experience.
Size: 1,131,575 photos
| Column | Description |
|---|---|
review_id | Links to the review |
photo_url | Where the image is hosted |
Key stat: About 20% of reviews include photos. These tend to be more credible — you can't easily fake a before/after photo.
5. Users Table
Aggregate stats per reviewer (across all their reviews).
Size: 1,445,498 unique users
| Column | Description |
|---|---|
author_id | The reviewer |
total_reviews | How many reviews they've written |
avg_rating | Their typical rating |
organic_reviews | Reviews without incentive |
first_review_date | When they started reviewing |
Why it matters: A reviewer with 50 reviews and a 3.8 average is more trustworthy than someone who only reviews once and gives 5 stars.
6. Metadata Table
Technical details about each review's origin.
Size: 4,391,587 rows
| Column | Description |
|---|---|
created_at | When the record was created |
updated_at | Last modification |
source_client | Mobile app? Desktop? In-store? |
campaign_id | If part of a marketing campaign |
How Tables Connect
Think of it like a family tree with Reviews at the center:
User Profiles
│
(review_id)
│
Users ◄──(author_id)── REVIEWS ──(product_id)──► Products
│
(review_id)
│
┌──────────┴──────────┐
▼ ▼
Engagement PhotosThe key connectors:
- review_id — Every satellite table links back to reviews
- product_id — Groups all reviews for one product
- author_id — Links to user-level statistics
The Deduplication Problem
Raw data had 19.6% duplicates. Why?
- Reviews updated over time (same ID, new content)
- Products re-scraped during recovery runs
- API pagination edge cases
Our solution: Keep the most recent version of each review_id. This preserved 4.4M unique reviews from 5.5M raw records.
Storage Format: Parquet
We save tables as Parquet files — a columnar format that's:
- Compressed: 6x smaller than CSV
- Fast: Only read columns you need
- Typed: Knows the difference between strings and numbers
Combined with DuckDB (an embedded SQL database), we can query 4.4M reviews in seconds without loading everything into memory.
What's Next?
Clean data is ready. Now we add intelligence.
Next: Intelligence Layer → — Three AI models that score every review.