The Big Picture
Think of this pipeline like a factory with four stations. Raw materials (reviews) come in one end, and actionable intelligence comes out the other.
The Four Stages
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ COLLECTION │ ──► │ CLEANING │ ──► │INTELLIGENCE │ ──► │ RANKING │
│ │ │ │ │ │ │ │
│ Scrape from │ │ Organize │ │ AI models │ │ Love Score │
│ Sephora API │ │ into tables │ │ add signals │ │ formula │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘Why This Order?
Each stage depends on the previous one:
- You can't analyze messy data — Cleaning must come before intelligence
- You can't rank without signals — ML scores must exist before Love Score
- Garbage in, garbage out — Quality at each stage compounds
Stage 1: Data Collection
What happens: We connect to the BazaarVoice API (the service behind Sephora's reviews) and download every review for every product.
The challenge: Rate limiting, 403 errors, pagination. We built smart retry logic and checkpointing so crashes don't lose progress.
The result: 5.5 million raw review records (before deduplication)
Stage 2: Data Cleaning
What happens: Raw JSON files are transformed into organized, queryable tables. We remove duplicates (19.6% of records), normalize fields, and split data into logical groups.
Why split into tables?
- Faster queries (don't read data you don't need)
- Less redundancy (user info stored once, not repeated)
- Easier updates (change one place, not everywhere)
The result: 6 clean tables with 4.4 million unique reviews
Stage 3: Intelligence Layer
What happens: Three AI models analyze every review and add quality signals.
| Model | Purpose | Output |
|---|---|---|
| Quality Scorer | Separate thoughtful reviews from junk | Score 0-1 per review |
| Fake Detector | Identify suspicious/incentivized reviews | Probability 0-1 |
| Sentiment Analyzer | Measure emotional tone beyond stars | Score 0-1 |
Each model catches different problems. A fake review might have good quality (well-written spam). A genuine review might have low quality (one word). Combining signals gives the full picture.
Stage 4: Product Ranking
What happens: All signals are combined using the Love Score formula — a weighted algorithm that balances:
- What organic (unpaid) reviewers think
- Community engagement (helpful votes, photos)
- Review authenticity percentage
- Demographic diversity
- Recent momentum
Plus adjustments for red flags (rating inflation, staff reviews, polarization) and green flags (power user endorsement).
The result: Every product gets a Love Score from 0 to 1, with full transparency about why.
The Data Flow
Sephora Website
│
▼
BazaarVoice API ──────► Raw JSONL Files (5.5M records)
│
▼
ETL Pipeline ──────► Clean Parquet Tables
│ │
│ ┌──────────┼──────────┐
│ ▼ ▼ ▼
│ Quality Fake Sentiment
│ Model Model Model
│ │ │ │
│ └──────────┼──────────┘
│ ▼
│ Score Tables
│ │
▼ ▼
Product Finder ◄─────── All Signals
│
▼
Ranked Products (Love Score)What's Next?
Now that you understand the big picture, dive into each stage:
- Data Collection → — How we scraped 4.4M reviews
- Data Cleaning → — The 6 tables and how they connect
- Intelligence Layer → — The three AI models
- Product Ranking → — The Love Score formula