What's Missing

What's Missing

The pipeline is functional but incomplete. Here are the gaps and next steps.


Gap 1: No Ingredients Data

The Problem

We know WHAT products are loved, but not WHY from a formulation perspective. The review data doesn't include ingredient lists.

What We Need

Scrape Sephora product pages to get:

  • Full ingredients list
  • Product description
  • How-to-use instructions
  • Size/volume

Next Step

Use the products_for_details.jsonl output (tiered list) to prioritize scraping. High-priority products first.

🔍The End Goal

Once we have ingredients, we can correlate them with review sentiment. Which ingredients appear in loved products? Which correlate with negative reviews?


Gap 2: No Pricing Information

The Problem

We can't calculate value-for-money without knowing prices. A $200 moisturizer might have the same Love Score as a $30 one.

What We Need

  • Current prices
  • Price history (if available)
  • Size information for price-per-ounce

Next Step

Add price extraction to the product page scraper. Sephora shows price on every product page.


Gap 3: No Category Hierarchy

The Problem

Products aren't fully categorized. We know a product exists, but not its place in the taxonomy.

Example: Is this a...

  • Skincare > Moisturizers > Day Creams
  • Or Skincare > Treatments > Serums?

What We Need

Full category path from Sephora's navigation structure.

Next Step

Extract breadcrumbs from product pages during detail scraping.


Gap 4: Ingredient Intelligence

The Problem

Even once we have ingredient lists, they're just text. We need to understand them.

What We Need

  1. Parse ingredient lists into structured data

    • Separate each ingredient
    • Normalize names (Vitamin C = Ascorbic Acid)
  2. Identify ingredient types

    • Active ingredients vs. fillers
    • Preservatives, fragrances, emulsifiers
  3. Cluster by formulation

    • Group similar products
    • Identify formulation patterns
  4. Correlate with sentiment

    • Which ingredients appear in loved products?
    • Which trigger negative reviews?

Next Step

This is Phase 2 of the project. Requires:

  • NER model for ingredient extraction
  • Ingredient database for classification
  • Analysis pipeline for correlations

Gap 5: Real-Time Updates

The Problem

The data is a snapshot. New reviews are posted daily.

What We Need

  • Incremental scraping (only new reviews)
  • Score recalculation on schedule
  • Trend detection for rising/falling products

Next Step

Build a scheduler that:

  1. Runs collection weekly
  2. Appends new reviews to tables
  3. Recalculates scores

Gap 6: User Interface

The Problem

The outputs are JSON files. Not user-friendly for exploration.

What We Need

  • Searchable product database
  • Filters by category, price, skin type
  • Comparison views
  • Explanation tooltips

Next Step

Build a simple web app or Notion database for exploration.


Priority Order

GapImpactEffortPriority
IngredientsHighMedium1st
PricingHighLow2nd
CategoriesMediumLow3rd
Ingredient IntelligenceVery HighHigh4th
Real-TimeMediumMedium5th
User InterfaceMediumMedium6th
Good News

The core pipeline is done. These gaps are enhancements, not blockers. The Love Score already works without them.


What's Next?

If you're unfamiliar with any terms used in this documentation, check the glossary.

Next: Glossary → — Plain-English definitions.