How to Run the Pipeline

Here's the correct order to execute everything. Each step depends on the previous ones.

Execution Sequence

Step 1 ──► Step 2 ──► Step 3 ──► Step 4
                          │
                          ├──► Step 5 (parallel)
                          │
                          └──► Step 6 ──► Step 7
                                              │
                                              ▼
                               Step 8 ──► Step 9 ──► Step 10

Step 1: Data Collection

What: Scrape reviews from Sephora's BazaarVoice API

Input: Sephora sitemap (product URLs)

Output: Raw JSONL files (~2 GB compressed)

Time: Several days (rate limited)

⚠️Rate Limits

The API allows ~1 request every 2 seconds. Don't run this unless you need fresh data.

Step 2: Data Cleaning

What: Transform raw files into organized tables. Deduplicate and normalize.

Input: Raw JSONL files

Output: 6 Parquet tables:

reviews.parquet
user_profiles.parquet
review_engagement.parquet
review_photos.parquet
users.parquet
review_metadata.parquet

Time: ~30 minutes

Step 3: Train Review Quality Model

What: Learn what makes a review high-quality using incentivized flag as signal.

Input: Clean Parquet tables

Output: models/review_quality_classifier.pkl

Time: ~10 minutes

Step 4: Score Review Quality

What: Apply the trained model to all 4.4M reviews.

Input:

Trained quality model
Clean Parquet tables

Output:

review_quality_scores.parquet (per review)
product_quality_scores.parquet (aggregated)

Time: ~15 minutes

Step 5: Score Sentiment

What: Run sentiment analysis on all review text using DistilBERT.

Input: reviews.parquet (text column)

Output: review_sentiment_scores.parquet

Time: ~2-4 hours (GPU recommended)

💡Parallel Execution

Steps 5-6 can run in parallel with Step 4 since they only need the clean tables, not the quality scores.

Step 6: Train Fake Detection Model

What: Learn to detect fake/incentivized reviews.

Input: All clean Parquet tables

Output:

models/fake_detection/traditional_best.pkl
models/fake_detection/bert/ (if using BERT)

Time: ~30 minutes (traditional) to hours (BERT, needs GPU)

Optional: If you don't have GPU, skip BERT training. Traditional-only still works.

Step 7: Score Fake Detection

What: Apply fake detection to all reviews.

Input:

Trained fake detection models
All Parquet tables

Output: review_fake_scores.parquet

fake_prob_traditional
fake_prob_bert (if available)
fake_prob_ensemble

Time: ~20 minutes (traditional) to hours (with BERT)

Step 8: Run Product Finder

What: Calculate Love Scores using all signals.

Input:

All Parquet tables
All score tables (quality, sentiment, fake)
Sitemap with product URLs

Output: products_to_scrape.jsonl

Time: ~5 minutes

Step 9: Run Product Prioritizer

What: Calculate tier scores for detail scraping.

Input:

reviews.parquet
Sitemap

Output: products_for_details.jsonl

Time: ~2 minutes

Step 10: Generate Reports

What: Create analytics reports from the database.

Input: DuckDB database (or Parquet tables)

Output:

sephora_analytics_report_{timestamp}.md
sephora_analytics_data_{timestamp}.json

Time: ~1 minute

Quick Reference

Step	Name	Depends On	Output
1	Collection	(start)	Raw JSONL
2	Cleaning	1	Parquet tables
3	Train Quality	2	quality model
4	Score Quality	3	quality scores
5	Score Sentiment	2	sentiment scores
6	Train Fake	2	fake models
7	Score Fake	6	fake scores
8	Product Finder	4, 5, 7	ranked products
9	Product Prioritizer	2	tiered products
10	Reports	2	analytics

Minimal Run

If you just want rankings without full ML:

Collection
Cleaning
Train Quality
Score Quality
Product Finder (uses quality only)

This skips fake detection and sentiment, but still gives useful rankings.

What's Next?

See what's not yet implemented.

Next: What's Missing → — The gaps to fill.

Outputs What's Missing