How to Run the Pipeline
Here's the correct order to execute everything. Each step depends on the previous ones.
Execution Sequence
Step 1 ──► Step 2 ──► Step 3 ──► Step 4
│
├──► Step 5 (parallel)
│
└──► Step 6 ──► Step 7
│
▼
Step 8 ──► Step 9 ──► Step 10Step 1: Data Collection
What: Scrape reviews from Sephora's BazaarVoice API
Input: Sephora sitemap (product URLs)
Output: Raw JSONL files (~2 GB compressed)
Time: Several days (rate limited)
The API allows ~1 request every 2 seconds. Don't run this unless you need fresh data.
Step 2: Data Cleaning
What: Transform raw files into organized tables. Deduplicate and normalize.
Input: Raw JSONL files
Output: 6 Parquet tables:
- reviews.parquet
- user_profiles.parquet
- review_engagement.parquet
- review_photos.parquet
- users.parquet
- review_metadata.parquet
Time: ~30 minutes
Step 3: Train Review Quality Model
What: Learn what makes a review high-quality using incentivized flag as signal.
Input: Clean Parquet tables
Output: models/review_quality_classifier.pkl
Time: ~10 minutes
Step 4: Score Review Quality
What: Apply the trained model to all 4.4M reviews.
Input:
- Trained quality model
- Clean Parquet tables
Output:
review_quality_scores.parquet(per review)product_quality_scores.parquet(aggregated)
Time: ~15 minutes
Step 5: Score Sentiment
What: Run sentiment analysis on all review text using DistilBERT.
Input: reviews.parquet (text column)
Output: review_sentiment_scores.parquet
Time: ~2-4 hours (GPU recommended)
Steps 5-6 can run in parallel with Step 4 since they only need the clean tables, not the quality scores.
Step 6: Train Fake Detection Model
What: Learn to detect fake/incentivized reviews.
Input: All clean Parquet tables
Output:
models/fake_detection/traditional_best.pklmodels/fake_detection/bert/(if using BERT)
Time: ~30 minutes (traditional) to hours (BERT, needs GPU)
Optional: If you don't have GPU, skip BERT training. Traditional-only still works.
Step 7: Score Fake Detection
What: Apply fake detection to all reviews.
Input:
- Trained fake detection models
- All Parquet tables
Output: review_fake_scores.parquet
fake_prob_traditionalfake_prob_bert(if available)fake_prob_ensemble
Time: ~20 minutes (traditional) to hours (with BERT)
Step 8: Run Product Finder
What: Calculate Love Scores using all signals.
Input:
- All Parquet tables
- All score tables (quality, sentiment, fake)
- Sitemap with product URLs
Output: products_to_scrape.jsonl
Time: ~5 minutes
Step 9: Run Product Prioritizer
What: Calculate tier scores for detail scraping.
Input:
- reviews.parquet
- Sitemap
Output: products_for_details.jsonl
Time: ~2 minutes
Step 10: Generate Reports
What: Create analytics reports from the database.
Input: DuckDB database (or Parquet tables)
Output:
sephora_analytics_report_{timestamp}.mdsephora_analytics_data_{timestamp}.json
Time: ~1 minute
Quick Reference
| Step | Name | Depends On | Output |
|---|---|---|---|
| 1 | Collection | (start) | Raw JSONL |
| 2 | Cleaning | 1 | Parquet tables |
| 3 | Train Quality | 2 | quality model |
| 4 | Score Quality | 3 | quality scores |
| 5 | Score Sentiment | 2 | sentiment scores |
| 6 | Train Fake | 2 | fake models |
| 7 | Score Fake | 6 | fake scores |
| 8 | Product Finder | 4, 5, 7 | ranked products |
| 9 | Product Prioritizer | 2 | tiered products |
| 10 | Reports | 2 | analytics |
Minimal Run
If you just want rankings without full ML:
- Collection
- Cleaning
- Train Quality
- Score Quality
- Product Finder (uses quality only)
This skips fake detection and sentiment, but still gives useful rankings.
What's Next?
See what's not yet implemented.
Next: What's Missing → — The gaps to fill.