What's Missing
The pipeline is functional but incomplete. Here are the gaps and next steps.
Gap 1: No Ingredients Data
The Problem
We know WHAT products are loved, but not WHY from a formulation perspective. The review data doesn't include ingredient lists.
What We Need
Scrape Sephora product pages to get:
- Full ingredients list
- Product description
- How-to-use instructions
- Size/volume
Next Step
Use the products_for_details.jsonl output (tiered list) to prioritize scraping. High-priority products first.
Once we have ingredients, we can correlate them with review sentiment. Which ingredients appear in loved products? Which correlate with negative reviews?
Gap 2: No Pricing Information
The Problem
We can't calculate value-for-money without knowing prices. A $200 moisturizer might have the same Love Score as a $30 one.
What We Need
- Current prices
- Price history (if available)
- Size information for price-per-ounce
Next Step
Add price extraction to the product page scraper. Sephora shows price on every product page.
Gap 3: No Category Hierarchy
The Problem
Products aren't fully categorized. We know a product exists, but not its place in the taxonomy.
Example: Is this a...
- Skincare > Moisturizers > Day Creams
- Or Skincare > Treatments > Serums?
What We Need
Full category path from Sephora's navigation structure.
Next Step
Extract breadcrumbs from product pages during detail scraping.
Gap 4: Ingredient Intelligence
The Problem
Even once we have ingredient lists, they're just text. We need to understand them.
What We Need
-
Parse ingredient lists into structured data
- Separate each ingredient
- Normalize names (Vitamin C = Ascorbic Acid)
-
Identify ingredient types
- Active ingredients vs. fillers
- Preservatives, fragrances, emulsifiers
-
Cluster by formulation
- Group similar products
- Identify formulation patterns
-
Correlate with sentiment
- Which ingredients appear in loved products?
- Which trigger negative reviews?
Next Step
This is Phase 2 of the project. Requires:
- NER model for ingredient extraction
- Ingredient database for classification
- Analysis pipeline for correlations
Gap 5: Real-Time Updates
The Problem
The data is a snapshot. New reviews are posted daily.
What We Need
- Incremental scraping (only new reviews)
- Score recalculation on schedule
- Trend detection for rising/falling products
Next Step
Build a scheduler that:
- Runs collection weekly
- Appends new reviews to tables
- Recalculates scores
Gap 6: User Interface
The Problem
The outputs are JSON files. Not user-friendly for exploration.
What We Need
- Searchable product database
- Filters by category, price, skin type
- Comparison views
- Explanation tooltips
Next Step
Build a simple web app or Notion database for exploration.
Priority Order
| Gap | Impact | Effort | Priority |
|---|---|---|---|
| Ingredients | High | Medium | 1st |
| Pricing | High | Low | 2nd |
| Categories | Medium | Low | 3rd |
| Ingredient Intelligence | Very High | High | 4th |
| Real-Time | Medium | Medium | 5th |
| User Interface | Medium | Medium | 6th |
The core pipeline is done. These gaps are enhancements, not blockers. The Love Score already works without them.
What's Next?
If you're unfamiliar with any terms used in this documentation, check the glossary.
Next: Glossary → — Plain-English definitions.