Outputs
The pipeline produces three main outputs. Here's what each contains and how to use it.
Output 1: Ranked Products (Full Detail)
A comprehensive list of products with complete transparency about their scores.
What's Included
For each product:
| Field | Description |
|---|---|
product_id | Unique identifier |
url | Sephora product page |
love_score | Final score (0-1) |
priority_strategy | How it was ranked (love, volume, etc.) |
Score Components:
| Field | Description |
|---|---|
organic_quality | 0-1 |
engagement_quality | 0-1 |
authenticity | 0-1 |
diversity | 0-1 |
trend | 0-1 |
Raw Metrics:
| Field | Description |
|---|---|
review_count | Total reviews |
avg_rating | Overall average |
organic_avg_rating | Unpaid reviewers only |
organic_ratio | % genuine reviews |
pct_negative | % 1-2 star reviews |
Adjustments:
| Field | Description |
|---|---|
triggered_adjustments | Which penalties/boosts fired |
score_explanation | Human-readable summary |
Example Entry
{
"product_id": "P12345",
"url": "https://sephora.com/product/...",
"love_score": 0.82,
"score_components": {
"organic_quality": 0.85,
"engagement_quality": 0.72,
"authenticity": 0.88,
"diversity": 0.65,
"trend": 0.58
},
"triggered_adjustments": ["power_user_boost"],
"score_explanation": "Strong organic quality (0.85), boosted by power user endorsement (+4%)"
}Use Case
When you want to understand why a product ranks where it does. Full transparency for analysis.
Output 2: Products Needing Details (Tiered List)
A prioritized list for the next phase: scraping product pages for ingredients, prices, and descriptions.
What's Included
| Field | Description |
|---|---|
product_id | Unique identifier |
url | Sephora product page |
priority_score | 0-100 points |
priority_tier | High / Medium / Low |
review_count | Total reviews |
avg_rating | Overall average |
The Tiers
| Tier | Score | Count | Action |
|---|---|---|---|
| High | 70+ | ~2,000 | Scrape immediately |
| Medium | 40-69 | ~4,000 | Scrape eventually |
| Low | 15-39 | ~3,000 | Skip for now |
| Skip | <15 | ~1,000 | Ignore |
Use Case
Feed this to the product page scraper to get:
- Current prices
- Full ingredient lists
- Product descriptions
- How-to-use instructions
- Size/volume information
The ranked products list is for analysis. The tiered list is for action — it tells the scraper what to prioritize next.
Output 3: Analytics Reports
Periodic reports that summarize the dataset at a macro level.
Report Contents
Executive Summary
- Total reviews, products, users
- Date range covered
- Key statistics
Rating Distribution
- 5-star: 64.3%
- 4-star: 18.7%
- 3-star: 7.5%
- 2-star: 4.2%
- 1-star: 5.3%
Demographic Breakdown
- Reviews by skin type
- Reviews by skin tone
- Reviews by age (where available)
Engagement Insights
- Average helpful votes
- Photo inclusion rate
- Substantive review percentage
Top Products
- Highest Love Score
- Most reviewed
- Best organic rating
Polarizing Products
- High variance + many negatives
- Love-it-or-hate-it patterns
Temporal Trends
- Reviews by month
- Reviews by year
- Seasonal patterns
Formats
Reports are generated in two formats:
- Markdown — Human-readable, good for documentation
- JSON — Machine-readable, good for dashboards
Use Case
Understanding the dataset at a high level. Spotting trends. Generating presentations.
File Locations
| Output | Location |
|---|---|
| Ranked Products | data/products/products_to_scrape.jsonl |
| Tiered Products | data/products/products_for_details.jsonl |
| Analytics Report | analysis/reports/sephora_analytics_report_{timestamp}.md |
| Analytics Data | analysis/reports/sephora_analytics_data_{timestamp}.json |
What's Next?
Now you know what the pipeline produces. Next, learn how to run it.
Next: How to Run → — The execution sequence.