1. Data Collection

Stage 1: Data Collection

The first step is getting the data. We connect to Sephora's review infrastructure and download everything.

What Is BazaarVoice?

Sephora doesn't build their own review system — they use a service called BazaarVoice. This is the same platform used by many major retailers. It provides:

  • Review submission forms
  • Moderation workflows
  • An API for accessing reviews

We tap into this API to collect reviews in bulk.

What We Collect

Every review contains rich information:

FieldDescriptionWhy It Matters
Star Rating1-5 starsThe obvious signal (but easily gamed)
Review TextThe actual written reviewReveals the why behind the rating
TitleHeadline of the reviewQuick summary of sentiment
Would RecommendYes/NoOften more honest than star rating
Helpful VotesHow many found it usefulCrowd validation
PhotosImages attachedVisual proof of actual use
Skin TypeOily, Dry, Combo, NormalDoes it work for people like you?
Skin ToneFair to Deep scaleMakeup shade relevance
Age Range18-24, 25-34, etc.Demographic context
IncentivizedGot free product?Critical for authenticity
StaffSephora employee?Bias indicator
DateWhen postedRecency matters
🔍The Incentivized Flag

This is gold. Sephora explicitly marks reviews where the person received free product. This becomes our ground truth for training models — incentivized reviews tend to be more positive than organic ones.

The Collection Process

Step 1: Get Product List

We start with Sephora's sitemap — a list of every product URL. This gives us roughly 10,000+ products to process.

Step 2: Request Reviews per Product

For each product, we call the BazaarVoice API to get all its reviews. Reviews come paginated (100 at a time), so a product with 1,500 reviews needs 15 API calls.

Step 3: Handle Rate Limits

We can only make about 1 request every 2 seconds. Going faster triggers 403 errors. So we:

  • Add delays between requests
  • Implement exponential backoff for retries
  • Save progress frequently

Step 4: Save Raw Data

Reviews are saved as compressed JSONL files (one JSON object per line, gzipped). This format is:

  • Easy to process line-by-line
  • Compresses well (text data)
  • Appendable (can resume without rewriting)

Challenges We Overcame

403 Errors

The API sometimes blocks requests, especially when we're too aggressive. On the first run, we lost about 100,000 reviews to 403s.

Solution: We re-ran the scraper with better retry logic and recovered them.

Checkpointing

If the scraper crashes at 3AM, we don't want to start over from zero.

Solution: Save progress after each product. On restart, skip already-completed products.

Pagination Edge Cases

Some products have 5,000+ reviews. Getting them all requires careful cursor management.

Solution: Follow the API's pagination cursors exactly, verify counts match.

The Numbers

MetricValue
Total products scraped~10,000+
Raw review records5.5 million
Time span2008 - 2025 (17 years)
Compressed data size~2 GB
⚠️Duplicates Ahead

Raw data has duplicates — reviews updated over time, re-scraped products, etc. The next stage (Cleaning) handles deduplication.


What's Next?

The raw data is messy. Reviews are scattered across files, duplicates exist, and the JSON structure is complex.

Next: Data Cleaning → — How we organize this into clean tables.