The Big Picture

Think of this pipeline like a factory with four stations. Raw materials (reviews) come in one end, and actionable intelligence comes out the other.

The Four Stages

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ COLLECTION  │ ──► │  CLEANING   │ ──► │INTELLIGENCE │ ──► │  RANKING    │
│             │     │             │     │             │     │             │
│ Scrape from │     │ Organize    │     │ AI models   │     │ Love Score  │
│ Sephora API │     │ into tables │     │ add signals │     │ formula     │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Why This Order?

Each stage depends on the previous one:

You can't analyze messy data — Cleaning must come before intelligence
You can't rank without signals — ML scores must exist before Love Score
Garbage in, garbage out — Quality at each stage compounds

Stage 1: Data Collection

What happens: We connect to the BazaarVoice API (the service behind Sephora's reviews) and download every review for every product.

The challenge: Rate limiting, 403 errors, pagination. We built smart retry logic and checkpointing so crashes don't lose progress.

The result: 5.5 million raw review records (before deduplication)

Stage 2: Data Cleaning

What happens: Raw JSON files are transformed into organized, queryable tables. We remove duplicates (19.6% of records), normalize fields, and split data into logical groups.

Why split into tables?

Faster queries (don't read data you don't need)
Less redundancy (user info stored once, not repeated)
Easier updates (change one place, not everywhere)

The result: 6 clean tables with 4.4 million unique reviews

Stage 3: Intelligence Layer

What happens: Three AI models analyze every review and add quality signals.

Model	Purpose	Output
Quality Scorer	Separate thoughtful reviews from junk	Score 0-1 per review
Fake Detector	Identify suspicious/incentivized reviews	Probability 0-1
Sentiment Analyzer	Measure emotional tone beyond stars	Score 0-1

🔍Why Three Models?

Each model catches different problems. A fake review might have good quality (well-written spam). A genuine review might have low quality (one word). Combining signals gives the full picture.

Stage 4: Product Ranking

What happens: All signals are combined using the Love Score formula — a weighted algorithm that balances:

What organic (unpaid) reviewers think
Community engagement (helpful votes, photos)
Review authenticity percentage
Demographic diversity
Recent momentum

Plus adjustments for red flags (rating inflation, staff reviews, polarization) and green flags (power user endorsement).

The result: Every product gets a Love Score from 0 to 1, with full transparency about why.

The Data Flow

Sephora Website
      │
      ▼
BazaarVoice API ──────► Raw JSONL Files (5.5M records)
                              │
                              ▼
                        ETL Pipeline ──────► Clean Parquet Tables
                              │                    │
                              │         ┌──────────┼──────────┐
                              │         ▼          ▼          ▼
                              │     Quality    Fake      Sentiment
                              │     Model      Model     Model
                              │         │          │          │
                              │         └──────────┼──────────┘
                              │                    ▼
                              │              Score Tables
                              │                    │
                              ▼                    ▼
                        Product Finder ◄─────── All Signals
                              │
                              ▼
                     Ranked Products (Love Score)

What's Next?

Now that you understand the big picture, dive into each stage:

Data Collection → — How we scraped 4.4M reviews
Data Cleaning → — The 6 tables and how they connect
Intelligence Layer → — The three AI models
Product Ranking → — The Love Score formula

Executive Summary 1. Data Collection