Stage 1: Data Collection

The first step is getting the data. We connect to Sephora's review infrastructure and download everything.

What Is BazaarVoice?

Sephora doesn't build their own review system — they use a service called BazaarVoice. This is the same platform used by many major retailers. It provides:

Review submission forms
Moderation workflows
An API for accessing reviews

We tap into this API to collect reviews in bulk.

What We Collect

Every review contains rich information:

Field	Description	Why It Matters
Star Rating	1-5 stars	The obvious signal (but easily gamed)
Review Text	The actual written review	Reveals the why behind the rating
Title	Headline of the review	Quick summary of sentiment
Would Recommend	Yes/No	Often more honest than star rating
Helpful Votes	How many found it useful	Crowd validation
Photos	Images attached	Visual proof of actual use
Skin Type	Oily, Dry, Combo, Normal	Does it work for people like you?
Skin Tone	Fair to Deep scale	Makeup shade relevance
Age Range	18-24, 25-34, etc.	Demographic context
Incentivized	Got free product?	Critical for authenticity
Staff	Sephora employee?	Bias indicator
Date	When posted	Recency matters

🔍The Incentivized Flag

This is gold. Sephora explicitly marks reviews where the person received free product. This becomes our ground truth for training models — incentivized reviews tend to be more positive than organic ones.

The Collection Process

Step 1: Get Product List

We start with Sephora's sitemap — a list of every product URL. This gives us roughly 10,000+ products to process.

Step 2: Request Reviews per Product

For each product, we call the BazaarVoice API to get all its reviews. Reviews come paginated (100 at a time), so a product with 1,500 reviews needs 15 API calls.

Step 3: Handle Rate Limits

We can only make about 1 request every 2 seconds. Going faster triggers 403 errors. So we:

Add delays between requests
Implement exponential backoff for retries
Save progress frequently

Step 4: Save Raw Data

Reviews are saved as compressed JSONL files (one JSON object per line, gzipped). This format is:

Easy to process line-by-line
Compresses well (text data)
Appendable (can resume without rewriting)

Challenges We Overcame

403 Errors

The API sometimes blocks requests, especially when we're too aggressive. On the first run, we lost about 100,000 reviews to 403s.

Solution: We re-ran the scraper with better retry logic and recovered them.

Checkpointing

If the scraper crashes at 3AM, we don't want to start over from zero.

Solution: Save progress after each product. On restart, skip already-completed products.

Pagination Edge Cases

Some products have 5,000+ reviews. Getting them all requires careful cursor management.

Solution: Follow the API's pagination cursors exactly, verify counts match.

The Numbers

Metric	Value
Total products scraped	~10,000+
Raw review records	5.5 million
Time span	2008 - 2025 (17 years)
Compressed data size	~2 GB

⚠️Duplicates Ahead

Raw data has duplicates — reviews updated over time, re-scraped products, etc. The next stage (Cleaning) handles deduplication.

What's Next?

The raw data is messy. Reviews are scattered across files, duplicates exist, and the JSON structure is complex.

Next: Data Cleaning → — How we organize this into clean tables.

The Big Picture 2. Data Cleaning