The Big Picture

The Big Picture

Think of this pipeline like a factory with four stations. Raw materials (reviews) come in one end, and actionable intelligence comes out the other.

The Four Stages

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ COLLECTION  │ ──► │  CLEANING   │ ──► │INTELLIGENCE │ ──► │  RANKING    │
│             │     │             │     │             │     │             │
│ Scrape from │     │ Organize    │     │ AI models   │     │ Love Score  │
│ Sephora API │     │ into tables │     │ add signals │     │ formula     │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Why This Order?

Each stage depends on the previous one:

  1. You can't analyze messy data — Cleaning must come before intelligence
  2. You can't rank without signals — ML scores must exist before Love Score
  3. Garbage in, garbage out — Quality at each stage compounds

Stage 1: Data Collection

What happens: We connect to the BazaarVoice API (the service behind Sephora's reviews) and download every review for every product.

The challenge: Rate limiting, 403 errors, pagination. We built smart retry logic and checkpointing so crashes don't lose progress.

The result: 5.5 million raw review records (before deduplication)


Stage 2: Data Cleaning

What happens: Raw JSON files are transformed into organized, queryable tables. We remove duplicates (19.6% of records), normalize fields, and split data into logical groups.

Why split into tables?

  • Faster queries (don't read data you don't need)
  • Less redundancy (user info stored once, not repeated)
  • Easier updates (change one place, not everywhere)

The result: 6 clean tables with 4.4 million unique reviews


Stage 3: Intelligence Layer

What happens: Three AI models analyze every review and add quality signals.

ModelPurposeOutput
Quality ScorerSeparate thoughtful reviews from junkScore 0-1 per review
Fake DetectorIdentify suspicious/incentivized reviewsProbability 0-1
Sentiment AnalyzerMeasure emotional tone beyond starsScore 0-1
🔍Why Three Models?

Each model catches different problems. A fake review might have good quality (well-written spam). A genuine review might have low quality (one word). Combining signals gives the full picture.


Stage 4: Product Ranking

What happens: All signals are combined using the Love Score formula — a weighted algorithm that balances:

  • What organic (unpaid) reviewers think
  • Community engagement (helpful votes, photos)
  • Review authenticity percentage
  • Demographic diversity
  • Recent momentum

Plus adjustments for red flags (rating inflation, staff reviews, polarization) and green flags (power user endorsement).

The result: Every product gets a Love Score from 0 to 1, with full transparency about why.


The Data Flow

Sephora Website


BazaarVoice API ──────► Raw JSONL Files (5.5M records)


                        ETL Pipeline ──────► Clean Parquet Tables
                              │                    │
                              │         ┌──────────┼──────────┐
                              │         ▼          ▼          ▼
                              │     Quality    Fake      Sentiment
                              │     Model      Model     Model
                              │         │          │          │
                              │         └──────────┼──────────┘
                              │                    ▼
                              │              Score Tables
                              │                    │
                              ▼                    ▼
                        Product Finder ◄─────── All Signals


                     Ranked Products (Love Score)

What's Next?

Now that you understand the big picture, dive into each stage:

  1. Data Collection → — How we scraped 4.4M reviews
  2. Data Cleaning → — The 6 tables and how they connect
  3. Intelligence Layer → — The three AI models
  4. Product Ranking → — The Love Score formula