How to Run

How to Run the Pipeline

Here's the correct order to execute everything. Each step depends on the previous ones.


Execution Sequence

Step 1 ──► Step 2 ──► Step 3 ──► Step 4

                          ├──► Step 5 (parallel)

                          └──► Step 6 ──► Step 7


                               Step 8 ──► Step 9 ──► Step 10

Step 1: Data Collection

What: Scrape reviews from Sephora's BazaarVoice API

Input: Sephora sitemap (product URLs)

Output: Raw JSONL files (~2 GB compressed)

Time: Several days (rate limited)

⚠️Rate Limits

The API allows ~1 request every 2 seconds. Don't run this unless you need fresh data.


Step 2: Data Cleaning

What: Transform raw files into organized tables. Deduplicate and normalize.

Input: Raw JSONL files

Output: 6 Parquet tables:

  • reviews.parquet
  • user_profiles.parquet
  • review_engagement.parquet
  • review_photos.parquet
  • users.parquet
  • review_metadata.parquet

Time: ~30 minutes


Step 3: Train Review Quality Model

What: Learn what makes a review high-quality using incentivized flag as signal.

Input: Clean Parquet tables

Output: models/review_quality_classifier.pkl

Time: ~10 minutes


Step 4: Score Review Quality

What: Apply the trained model to all 4.4M reviews.

Input:

  • Trained quality model
  • Clean Parquet tables

Output:

  • review_quality_scores.parquet (per review)
  • product_quality_scores.parquet (aggregated)

Time: ~15 minutes


Step 5: Score Sentiment

What: Run sentiment analysis on all review text using DistilBERT.

Input: reviews.parquet (text column)

Output: review_sentiment_scores.parquet

Time: ~2-4 hours (GPU recommended)

💡Parallel Execution

Steps 5-6 can run in parallel with Step 4 since they only need the clean tables, not the quality scores.


Step 6: Train Fake Detection Model

What: Learn to detect fake/incentivized reviews.

Input: All clean Parquet tables

Output:

  • models/fake_detection/traditional_best.pkl
  • models/fake_detection/bert/ (if using BERT)

Time: ~30 minutes (traditional) to hours (BERT, needs GPU)

Optional: If you don't have GPU, skip BERT training. Traditional-only still works.


Step 7: Score Fake Detection

What: Apply fake detection to all reviews.

Input:

  • Trained fake detection models
  • All Parquet tables

Output: review_fake_scores.parquet

  • fake_prob_traditional
  • fake_prob_bert (if available)
  • fake_prob_ensemble

Time: ~20 minutes (traditional) to hours (with BERT)


Step 8: Run Product Finder

What: Calculate Love Scores using all signals.

Input:

  • All Parquet tables
  • All score tables (quality, sentiment, fake)
  • Sitemap with product URLs

Output: products_to_scrape.jsonl

Time: ~5 minutes


Step 9: Run Product Prioritizer

What: Calculate tier scores for detail scraping.

Input:

  • reviews.parquet
  • Sitemap

Output: products_for_details.jsonl

Time: ~2 minutes


Step 10: Generate Reports

What: Create analytics reports from the database.

Input: DuckDB database (or Parquet tables)

Output:

  • sephora_analytics_report_{timestamp}.md
  • sephora_analytics_data_{timestamp}.json

Time: ~1 minute


Quick Reference

StepNameDepends OnOutput
1Collection(start)Raw JSONL
2Cleaning1Parquet tables
3Train Quality2quality model
4Score Quality3quality scores
5Score Sentiment2sentiment scores
6Train Fake2fake models
7Score Fake6fake scores
8Product Finder4, 5, 7ranked products
9Product Prioritizer2tiered products
10Reports2analytics

Minimal Run

If you just want rankings without full ML:

  1. Collection
  2. Cleaning
  3. Train Quality
  4. Score Quality
  5. Product Finder (uses quality only)

This skips fake detection and sentiment, but still gives useful rankings.


What's Next?

See what's not yet implemented.

Next: What's Missing → — The gaps to fill.