Stage 1: Data Collection
The first step is getting the data. We connect to Sephora's review infrastructure and download everything.
What Is BazaarVoice?
Sephora doesn't build their own review system — they use a service called BazaarVoice. This is the same platform used by many major retailers. It provides:
- Review submission forms
- Moderation workflows
- An API for accessing reviews
We tap into this API to collect reviews in bulk.
What We Collect
Every review contains rich information:
| Field | Description | Why It Matters |
|---|---|---|
| Star Rating | 1-5 stars | The obvious signal (but easily gamed) |
| Review Text | The actual written review | Reveals the why behind the rating |
| Title | Headline of the review | Quick summary of sentiment |
| Would Recommend | Yes/No | Often more honest than star rating |
| Helpful Votes | How many found it useful | Crowd validation |
| Photos | Images attached | Visual proof of actual use |
| Skin Type | Oily, Dry, Combo, Normal | Does it work for people like you? |
| Skin Tone | Fair to Deep scale | Makeup shade relevance |
| Age Range | 18-24, 25-34, etc. | Demographic context |
| Incentivized | Got free product? | Critical for authenticity |
| Staff | Sephora employee? | Bias indicator |
| Date | When posted | Recency matters |
This is gold. Sephora explicitly marks reviews where the person received free product. This becomes our ground truth for training models — incentivized reviews tend to be more positive than organic ones.
The Collection Process
Step 1: Get Product List
We start with Sephora's sitemap — a list of every product URL. This gives us roughly 10,000+ products to process.
Step 2: Request Reviews per Product
For each product, we call the BazaarVoice API to get all its reviews. Reviews come paginated (100 at a time), so a product with 1,500 reviews needs 15 API calls.
Step 3: Handle Rate Limits
We can only make about 1 request every 2 seconds. Going faster triggers 403 errors. So we:
- Add delays between requests
- Implement exponential backoff for retries
- Save progress frequently
Step 4: Save Raw Data
Reviews are saved as compressed JSONL files (one JSON object per line, gzipped). This format is:
- Easy to process line-by-line
- Compresses well (text data)
- Appendable (can resume without rewriting)
Challenges We Overcame
403 Errors
The API sometimes blocks requests, especially when we're too aggressive. On the first run, we lost about 100,000 reviews to 403s.
Solution: We re-ran the scraper with better retry logic and recovered them.
Checkpointing
If the scraper crashes at 3AM, we don't want to start over from zero.
Solution: Save progress after each product. On restart, skip already-completed products.
Pagination Edge Cases
Some products have 5,000+ reviews. Getting them all requires careful cursor management.
Solution: Follow the API's pagination cursors exactly, verify counts match.
The Numbers
| Metric | Value |
|---|---|
| Total products scraped | ~10,000+ |
| Raw review records | 5.5 million |
| Time span | 2008 - 2025 (17 years) |
| Compressed data size | ~2 GB |
Raw data has duplicates — reviews updated over time, re-scraped products, etc. The next stage (Cleaning) handles deduplication.
What's Next?
The raw data is messy. Reviews are scattered across files, duplicates exist, and the JSON structure is complex.
Next: Data Cleaning → — How we organize this into clean tables.