What's Missing

The pipeline is functional but incomplete. Here are the gaps and next steps.

Gap 1: No Ingredients Data

The Problem

We know WHAT products are loved, but not WHY from a formulation perspective. The review data doesn't include ingredient lists.

What We Need

Scrape Sephora product pages to get:

Full ingredients list
Product description
How-to-use instructions
Size/volume

Next Step

Use the products_for_details.jsonl output (tiered list) to prioritize scraping. High-priority products first.

🔍The End Goal

Once we have ingredients, we can correlate them with review sentiment. Which ingredients appear in loved products? Which correlate with negative reviews?

Gap 2: No Pricing Information

The Problem

We can't calculate value-for-money without knowing prices. A $200 moisturizer might have the same Love Score as a $30 one.

What We Need

Current prices
Price history (if available)
Size information for price-per-ounce

Next Step

Add price extraction to the product page scraper. Sephora shows price on every product page.

Gap 3: No Category Hierarchy

The Problem

Products aren't fully categorized. We know a product exists, but not its place in the taxonomy.

Example: Is this a...

Skincare > Moisturizers > Day Creams
Or Skincare > Treatments > Serums?

What We Need

Full category path from Sephora's navigation structure.

Next Step

Extract breadcrumbs from product pages during detail scraping.

Gap 4: Ingredient Intelligence

The Problem

Even once we have ingredient lists, they're just text. We need to understand them.

What We Need

Parse ingredient lists into structured data
- Separate each ingredient
- Normalize names (Vitamin C = Ascorbic Acid)
Identify ingredient types
- Active ingredients vs. fillers
- Preservatives, fragrances, emulsifiers
Cluster by formulation
- Group similar products
- Identify formulation patterns
Correlate with sentiment
- Which ingredients appear in loved products?
- Which trigger negative reviews?

Next Step

This is Phase 2 of the project. Requires:

NER model for ingredient extraction
Ingredient database for classification
Analysis pipeline for correlations

Gap 5: Real-Time Updates

The Problem

The data is a snapshot. New reviews are posted daily.

What We Need

Incremental scraping (only new reviews)
Score recalculation on schedule
Trend detection for rising/falling products

Next Step

Build a scheduler that:

Runs collection weekly
Appends new reviews to tables
Recalculates scores

Gap 6: User Interface

The Problem

The outputs are JSON files. Not user-friendly for exploration.

What We Need

Searchable product database
Filters by category, price, skin type
Comparison views
Explanation tooltips

Next Step

Build a simple web app or Notion database for exploration.

Priority Order

Gap	Impact	Effort	Priority
Ingredients	High	Medium	1st
Pricing	High	Low	2nd
Categories	Medium	Low	3rd
Ingredient Intelligence	Very High	High	4th
Real-Time	Medium	Medium	5th
User Interface	Medium	Medium	6th

✅Good News

The core pipeline is done. These gaps are enhancements, not blockers. The Love Score already works without them.

What's Next?

If you're unfamiliar with any terms used in this documentation, check the glossary.

Next: Glossary → — Plain-English definitions.

How to Run Glossary