How We Analyzed 4.4 Million Sephora Reviews
Finding products people actually love — not just products with good marketing
Introduction
The beauty industry is built on trust. When you're about to spend $50 on a foundation or $120 on a skincare serum, you want to know it actually works. So you do what millions of people do every day: you read the reviews.
But here's the problem: not all reviews are created equal.
A product with 1,000 five-star reviews sounds great. But what if 32% of those reviews came from people who got the product for free? What if incentivized reviewers rate products nearly a full star higher than people who actually paid for them? What if the glowing recommendations you're reading are driven by obligation rather than genuine enthusiasm?
These aren't hypothetical questions. They're exactly what we found when we analyzed 4.4 million Sephora reviews spanning 17 years of customer feedback.
Star ratings are corrupted by incentivized reviews. A 4.5-star product might really be a 3.7-star product when you only count organic buyers. We built a system to find the truth.
This article documents our journey: how we collected the data, what we discovered about review authenticity, and how we built a machine learning pipeline to surface products that are genuinely loved — not just well-marketed.
Along the way, we'll share the technical challenges we faced, the algorithms we developed, and the surprising patterns we found hidden in millions of reviews. Whether you're a data scientist, a beauty enthusiast, or just someone curious about how online reviews actually work, there's something here for you.
Let's start with the fundamental problem.
The Problem: Why Star Ratings Lie
Star ratings seem straightforward. One star is bad, five stars is good, and you should buy products with higher ratings. Simple, right?
Unfortunately, this intuition breaks down completely when you examine how reviews are actually generated. The rating you see is an average — but an average of what?
The Incentivized Review Problem
Brands want good reviews. Good reviews drive sales. So brands give away free products to people who promise to write reviews. This is called incentivized reviewing, and it's everywhere.
On Sephora, incentivized reviews are tagged with disclosures like "I received this product for free" or marked with a campaign ID in the metadata. This transparency is good — it means we can measure exactly how incentivized reviews differ from organic ones.
And the differences are dramatic.
32.1% of all Sephora reviews are incentivized. That's nearly one in three reviews written by someone who didn't pay for the product they're reviewing.
But the percentage alone doesn't tell the full story. What matters is how these reviewers behave differently from organic buyers.
Organic vs Paid: The Data
We compared every metric we could measure between organic reviewers (people who bought the product with their own money) and incentivized reviewers (people who received the product for free or at a discount).
The results are striking:
Let's break down what each of these differences means:
1. Rating Inflation (+0.80 Stars)
Incentivized reviewers rate products 0.80 stars higher on average than organic reviewers. That's a 21% inflation.
Think about what this means in practice: if you see a product with a 4.5-star average rating, and 32% of those reviews are incentivized, the "true" organic rating might be closer to 3.8 stars. That's the difference between "excellent" and "pretty good."
Why does this happen? Several reasons:
- Reciprocity bias: When someone gives you something for free, you feel obligated to reciprocate. Giving a negative review feels like ingratitude.
- Selection bias: Brands don't randomly give away products. They target people who are already fans of the brand or product category.
- Anchoring: Free product recipients often know what rating the brand is hoping for, even if it's not explicitly stated.
2. The Recommendation Lie (+24.5 Percentage Points)
94.9% of incentivized reviewers recommend the product, compared to only 70.4% of organic reviewers. That's a massive 24.5 percentage point gap.
The "Would you recommend this product?" checkbox is supposed to be a simple yes/no signal of overall satisfaction. But when nearly everyone who gets a free product says "yes," the signal becomes meaningless.
When an incentivized reviewer says "I recommend this product," there's only a 70% chance an organic buyer would agree. The recommendation rate is inflated by ~35%.
3. Length ≠ Quality (+72% Longer)
Incentivized reviews are 72% longer on average (349 characters vs 202 characters). At first glance, this seems positive — more detailed reviews should be more helpful, right?
But length isn't quality. Incentivized reviewers write longer reviews because they feel obligated to justify receiving a free product. They pad their reviews with generic phrases, extensive product descriptions (often copied from the product page), and effusive praise that organic reviewers don't bother with.
In fact, we found that review length is actually a suspicion signal in our fake detection models. Unusually long reviews are more likely to be incentivized.
4. The Helpful Votes Gap (8.3× Difference)
This is the most telling metric of all. Organic reviews receive 8.3 times more "helpful" votes than incentivized reviews.
The Sephora community has an implicit understanding of which reviews are actually useful. When thousands of shoppers vote on whether a review helped them make a purchasing decision, they consistently choose organic reviews over incentivized ones.
Before our machine learning models even run, the community has already surfaced the authentic reviews. Helpful votes are a powerful signal of review quality.
Detection Patterns
How do we identify incentivized reviews? Sephora provides some explicit labels, but we also developed pattern matching to catch additional cases. Here are some of the disclosure phrases we detect:
Detection patterns (81 total):
"in exchange for my honest review"
"I received this product for free"
"received this sample free"
"free product in exchange"
"complimentary sample"
"provided for free"
"sent to me for review"
"gifted by the brand"
"PR sample"
"courtesy of Sephora"
...and 71 more patternsWe also use the campaign_id field in the review metadata, which identifies reviews that were part of organized marketing campaigns.
The Dataset
Before we could build a system to find authentic product love, we needed data. Lots of it. We scraped reviews from Sephora's BazaarVoice API over the course of several days, building one of the most comprehensive beauty review datasets ever assembled.
Raw Numbers
Here's what we collected:
- 5.5 million raw records scraped from the API
- 4.4 million unique reviews after deduplication (19.6% were duplicates)
- 1.4 million unique users who wrote those reviews
- 1.1 million photos attached to reviews
- Reviews spanning 2008-2025 — 17 years of beauty product feedback
What Each Review Contains
Each review in our dataset includes:
- Rating: 1-5 stars
- Review text: The actual written review (98% contain text, not just stars)
- Title: Optional headline for the review
- Recommendation: "Would you recommend this product?" (yes/no)
- Helpful votes: How many people found the review helpful
- Unhelpful votes: How many people found it unhelpful
- Photos: Images attached to the review
- Timestamp: When the review was submitted
- User demographics: Skin type, skin tone, eye color, hair color, age range
- User metadata: Is the reviewer a Sephora employee? Is this an incentivized review?
- Campaign ID: If the review was part of a marketing campaign
- Source: Desktop, mobile web, iOS app, Android app
Data Quality & Completeness
Not every review has complete information. Here's how complete our demographic data is:
A few notes on data completeness:
- Skin Type and Skin Tone have excellent coverage (75-83%), making them reliable for diversity analysis.
- Eye Color and Hair Color are moderately complete (65-70%).
- Age Range is sparsely populated (15%), so we don't weight by age in our scoring algorithm.
The high completeness of skin type and skin tone data is particularly valuable for beauty products. A foundation that works across all skin tones is genuinely more inclusive than one that only works for a narrow range — and we can measure this.
The Pipeline
Our system transforms raw reviews into actionable product rankings through four distinct stages. Each stage builds on the previous one, progressively enriching the data with intelligence.
The diagram above shows the complete architecture. Data flows from left to right: scraped from the API, processed and stored, enriched with ML, and output as actionable rankings.
Let's dive deep into each stage.
Stage 1: Collection
The first challenge was getting the data. Sephora's reviews are powered by BazaarVoice, a third-party review platform. BazaarVoice provides an API, but it's heavily rate-limited to prevent abuse.
Rate Limiting Challenges
The API allows approximately 1 request every 2 seconds. Any faster, and you start getting 429 (Too Many Requests) or 403 (Forbidden) errors.
On our first scraping attempt, we didn't respect these limits carefully enough. The result? We lost over 100,000 reviews to 403 errors before we realized what was happening.
Exponential Backoff Algorithm
We implemented an exponential backoff algorithm that automatically adjusts our request rate based on the responses we receive:
def calculate_wait_time(error_code, retry_count):
"""Calculate wait time based on error type and retry count."""
if error_code == 429: # Too Many Requests
base_wait = 5
multiplier = 2 ** retry_count # 2, 4, 8, 16...
jitter = random.uniform(1, 5)
return base_wait * multiplier + jitter
elif error_code >= 500: # Server Errors
base_wait = 3
multiplier = 2 ** retry_count
jitter = random.uniform(1, 3)
return base_wait * multiplier + jitter
elif error_code == 408: # Timeout
base_wait = 2
multiplier = 2 ** retry_count
jitter = random.uniform(1, 3)
return base_wait * multiplier + jitter
return 2.0 # Default wait
# Configuration
MAX_RETRIES = 3 # Per product
CHECKPOINT_INTERVAL = 50 # Save progress every 50 productsScraping Modes
We developed three scraping modes to balance speed against reliability:
| Mode | Delay | Timeout | Workers | Use Case |
|---|---|---|---|---|
| Standard | 2.0s | 45s | 1 | Safe, reliable scraping |
| Fast-Safe | 0.3-0.8s | 15s | 8 | Parallel with smart throttling |
| Ultra | 1-2s | 20s | 4 | Maximum throughput |
Checkpointing
Scraping millions of reviews takes days. If the process crashes (server error, network issue, power outage), you don't want to start over from scratch.
We checkpoint our progress every 50 products, saving:
- Which products have been fully scraped
- Which products had errors (for retry)
- Current pagination state for each product
- Running statistics (review count, error rate)
This allows us to resume from exactly where we left off if anything goes wrong.
Stage 2: Cleaning & Normalization
Raw data is messy. The cleaning stage transforms our 5.5 million raw records into a clean, normalized database optimized for analysis.
Deduplication
19.6% of raw records were duplicates. These appeared for several reasons:
- The same review appearing on multiple product pages (product variants)
- API pagination issues returning the same reviews twice
- Reviews edited by users (both versions returned)
We deduplicated using the unique review_id field, keeping the most recent version of each review.
Star Schema Design
We normalized the data into a star schema with 6 tables. This design optimizes for analytical queries — you only read the data you need for each analysis.
Why This Design?
- Faster queries: Join only the tables you need. Analyzing ratings? You don't need to read photo data.
- Less redundancy: User information is stored once, not repeated in every review.
- Easier updates: If a user's profile changes, update one row instead of thousands.
- Analytical power: Query patterns become visible. Easy to aggregate by user, by product, by time period.
Stage 3: Intelligence (ML Models)
This is where the magic happens. We run three machine learning models on every review to extract signals about quality, authenticity, and sentiment.
Model 1: Review Quality Scoring
Not all reviews are equally useful. A thoughtful, detailed review with specific product feedback is more valuable than "Great product! Love it!"
Our quality model scores each review from 0 to 1 based on:
- Substantiveness: Does it contain specific details about the product?
- Helpfulness: Does it address common questions buyers might have?
- Coherence: Is it well-written and easy to understand?
- Specificity: Does it mention specific use cases, comparisons, or results?
Model 2: Fake/Incentivized Detection
Even with disclosure labels, some incentivized reviews slip through. Our detection model identifies reviews that behave like incentivized reviews, even if they're not explicitly labeled.
The model is trained on labeled incentivized reviews and learns to recognize patterns like:
- Suspiciously positive sentiment for new accounts
- Templated language patterns
- Timing clusters (many reviews posted in a short window)
- Semantic similarity to known incentivized reviews
Model 3: Sentiment Analysis
Star ratings don't tell the whole story. Someone might give 4 stars but write "This is my holy grail product! I'll never use anything else!" That's 5-star sentiment with a 4-star rating.
Our sentiment model extracts the true emotional signal from the review text, allowing us to:
- Identify "strict raters" who write glowing reviews but give conservative stars
- Catch mismatches between text sentiment and rating
- Measure enthusiasm beyond the 1-5 scale
Stage 4: Ranking (Love Score)
The final stage combines all of our intelligence into a single metric: the Love Score.
The Love Score answers a simple question: "Is this product genuinely loved by customers who paid for it?"
It's not a simple average. It's a weighted combination of multiple signals, adjusted for confidence and corrected for known biases. We'll break down exactly how it works in the next section.
What 4.4 Million Ratings Look Like
Before diving into our scoring algorithm, let's look at the raw rating distribution across all 4.4 million reviews.
64.3% of all reviews are 5 stars. At first glance, this looks great — customers love Sephora products!
But we know from our earlier analysis that incentivized reviews inflate ratings by 0.80 stars on average. When we look at only organic reviews, the distribution shifts:
Organic 5-star rate: ~58% (down from 64.3%)
Organic average rating: 3.84 (down from 4.21 overall)
This 6 percentage point difference in 5-star reviews doesn't sound like much, but it compounds across millions of reviews. Products that appear to have overwhelming positive sentiment often have a more nuanced reality.
The "J-Curve" Problem
Notice the shape of the distribution: lots of 5-stars, then each lower rating has fewer reviews, except for a bump at 1-star. This is called a "J-curve" and it's common in review systems.
Why does this happen? People are most motivated to review when they have strong feelings — either very positive or very negative. The people in the middle (3-4 stars) are less likely to bother writing a review.
This means review distributions are inherently bimodal. A product with 90% 5-star reviews might actually have lots of satisfied but silent 4-star customers. Our algorithm accounts for this by looking at the full distribution, not just the average.
The Love Score
The Love Score is our proprietary 0-1 metric for measuring genuine customer love. It's designed to answer the question: "If I buy this product, will I love it?"
The score is calculated in three steps:
- Component Calculation: Five weighted signals are combined
- Confidence Multiplier: The raw score is scaled by our confidence level
- Adjustments: Penalties and boosts are applied for specific patterns
Let's break down each step.
Step 1: The Five Components
The Love Score combines five distinct signals, each measuring a different aspect of product quality:
Component 1: Organic Quality (35%)
This is the most important component. We look at ratings and recommendations only from organic reviewers — people who bought the product with their own money.
The Organic Quality score is calculated as:
organic_quality = (
0.60 × normalized_organic_rating +
0.40 × organic_recommendation_rate
)
where:
normalized_organic_rating = (avg_organic_rating - 1) / 4
organic_recommendation_rate = organic_recommends / organic_totalBy focusing exclusively on organic reviews, we eliminate the inflation bias from incentivized reviewers. A product with a 4.5-star organic rating is genuinely better than one with a 4.5-star overall rating (which might be inflated by free samples).
Component 2: Engagement (25%)
Engagement measures how much the community values the reviews for this product. High engagement means reviews are substantive and helpful.
engagement = (
0.50 × helpfulness_ratio +
0.30 × photo_rate +
0.20 × substantive_review_rate
)
where:
helpfulness_ratio = helpful_votes / (helpful + unhelpful + 1)
photo_rate = reviews_with_photos / total_reviews
substantive_review_rate = reviews_over_50_words / total_reviewsProducts with high engagement have reviews that other shoppers find useful. This is a strong signal of genuine customer interest.
Component 3: Authenticity (15%)
Authenticity measures the ratio of organic to total reviews. Products that rely heavily on incentivized reviews to boost their ratings are penalized.
authenticity = organic_reviews / total_reviews
# If we can't determine organic status, default to 0.13
# (100% - 32.1% = 67.9%, but we're conservative)A product where 90% of reviews are organic scores higher than one where only 50% are organic, all else being equal.
Component 4: Diversity (15%)
Does this product work for everyone, or just a narrow demographic? Diversity measures how well-loved a product is across different skin types and skin tones.
diversity = (
0.50 × skin_type_diversity +
0.50 × skin_tone_diversity
)
# Diversity is measured as the inverse of variance in ratings
# across demographic groups. High diversity = consistent ratings
# across all skin types/tones.A foundation that works beautifully for fair skin but poorly for deep skin tones will score lower on diversity than one that works well across the spectrum.
Component 5: Trend (10%)
Is this product currently popular, or was it a hit years ago that's since fallen out of favor? Trend measures recent activity.
trend = reviews_last_180_days / total_reviews
# Products with recent reviews are weighted slightly higher
# This helps surface new products that are gaining tractionTrend is the lowest-weighted component because a great product is great regardless of when it became popular. But all else being equal, we prefer products that are actively being reviewed.
Step 2: The Confidence Multiplier
Raw scores don't account for sample size. A product with 3 perfect reviews shouldn't rank above a product with 1,000 good reviews.
The confidence multiplier scales scores based on how many organic reviews we have:
confidence = log(organic_reviews + 1) / log(150)
confidence = min(confidence, 1.0) # Cap at 100%
final_score = raw_score × confidenceThis logarithmic formula has some important properties:
Key observations:
- Early reviews matter most: Going from 0→10 reviews adds 46% confidence. Each review has a big impact.
- Diminishing returns: Going from 100→110 reviews adds only ~2% confidence. You're already confident.
- Saturation at 150: Beyond 150 organic reviews, confidence is 100%. More reviews don't help.
Why logarithmic? This mirrors how humans think about evidence. The difference between 0 and 10 reviews feels bigger than the difference between 100 and 110 reviews, even though both are +10.
Step 3: Adjustments
The final step applies penalties and boosts for specific patterns that our component scores might miss.
Penalties (Reduce Score)
| Penalty | Max Impact | Trigger Condition |
|---|---|---|
| Inflation | -15% | Paid vs organic rating gap > 0.8 stars |
| Staff | -14% | >30% reviews from Sephora employees |
| Polarization | -10% | High variance + >15% negative reviews |
| ML Quality | -12% | Average quality score < 0.5 |
Inflation Penalty (-15% max)
If there's a large gap between paid and organic ratings (more than 0.8 stars), we penalize the product. This catches products where incentivized reviewers are artificially inflating scores.
Staff Penalty (-14% max)
If more than 30% of reviews come from Sephora employees, we apply a penalty. Employees may have biases (brand loyalty, product access, job security concerns).
Polarization Penalty (-10%)
Some products have passionate lovers AND passionate haters. If rating variance is high AND more than 15% of reviews are 1-2 stars, we penalize. "Love it or hate it" products aren't universally good.
ML Quality Penalty (-12% max)
If the average review quality score (from our ML model) is below 0.5, we penalize. Products that attract low-effort, generic reviews don't provide useful signal.
Boosts (Increase Score)
| Boost | Max Impact | Trigger Condition |
|---|---|---|
| Power User | +8% | Loved by reviewers with 21+ organic reviews |
| Rating Trend | +10% | Recent ratings higher than historical |
Power User Boost (+8% max)
Power users are reviewers with 21+ organic reviews. They've tried everything and are hard to impress. If a significant portion of a product's reviewers are power users AND they love it, we boost the score.
Rating Trend Adjustment (±10%)
If recent ratings (last 90 days) are significantly higher than historical ratings, we boost. If they're declining, we penalize. This catches formulation changes, packaging issues, or improvements.
Putting It All Together
Here's an example of how a final Love Score is calculated:
Example Product: "Glow Serum Pro"
STEP 1: Component Scores
Organic Quality: 0.82 × 0.35 = 0.287
Engagement: 0.71 × 0.25 = 0.178
Authenticity: 0.75 × 0.15 = 0.113
Diversity: 0.68 × 0.15 = 0.102
Trend: 0.45 × 0.10 = 0.045
─────────────────────────────────────
Raw Score: 0.725
STEP 2: Confidence Multiplier
Organic Reviews: 180
Confidence: log(181) / log(150) = 1.0 (capped)
Weighted Score: 0.725 × 1.0 = 0.725
STEP 3: Adjustments
Power User Boost: +0.04 (8% of reviewers are power users)
Inflation Penalty: -0.02 (small organic/paid gap)
─────────────────────────────────────
Final Love Score: 0.745The 68+ ML Features
Our machine learning models analyze over 68 features for each review. These features are organized into 9 categories, each designed to catch different types of patterns.
Click any category to see the full list of features:
Feature Category Deep Dive
Advanced Linguistic Features (19)
These features analyze the writing style of reviews. We measure vocabulary diversity (type-token ratio), use of rare words (hapax legomena), readability (Flesch-Kincaid), and patterns like excessive first-person pronouns, hedge words ("maybe", "perhaps"), and intensifiers ("very", "really").
Temporal Patterns (10)
When a review is posted matters. We track the time of day, day of week, account age at time of review, and gaps between reviews. Coordinated campaigns often create "spikes" — many reviews posted in a short window.
User Behavior (6)
How does this reviewer behave across all their reviews? Do they always give 5 stars? Is this their only review ever? How does this rating compare to their personal average? One-time reviewers and "always 5-star" reviewers are suspicious.
Embedding Features (7 + 50 PCA components)
We use BERT embeddings to capture semantic meaning. The 768-dimensional BERT vectors are reduced to 50 principal components. We also measure semantic similarity between reviews (copy-paste detection) and identify spike patterns.
A sophisticated fake review might pass text analysis (well-written, specific details) but fail temporal analysis (posted during a campaign spike) or user analysis (reviewer only reviews this brand). Using 9 categories makes our detection robust.
Model Architecture
We use an ensemble of multiple models, combining traditional machine learning with deep learning for the best results.
Traditional ML Models
| Model | Configuration | Strengths |
|---|---|---|
| Logistic Regression | L2 regularization | Fast, interpretable coefficients |
| Gradient Boosting | 100 trees, depth=5 | Handles non-linear relationships |
| Random Forest | 100 trees, depth=10 | Robust to outliers |
Deep Learning: DistilBERT
For capturing nuanced semantic patterns, we use DistilBERT — a smaller, faster version of BERT that retains 97% of its language understanding capability.
DistilBERT Configuration:
Max sequence length: 256 tokens
Training epochs: 3
Batch size: 16
Learning rate: 2e-5
Mixed precision: Enabled (AMP)
GPU: Required for trainingEnsemble Strategy
Our final predictions combine traditional ML and deep learning:
final_prediction = (
0.50 × traditional_ml_prediction +
0.50 × bert_prediction
)
# Weights determined via grid search (0.0-1.0, 0.1 steps)
# 50/50 split performed best on validation setWhy ensemble? Traditional ML models are good at capturing explicit patterns in our engineered features. BERT is good at capturing implicit patterns in the text itself. Together, they catch things neither would catch alone.
Data Schema Details
For those interested in the technical details, here's the complete schema of our 6-table database:
review_id (PK) • product_id • author_id • rating • review_text • title • is_recommended • submitted_at
skin_type • skin_tone • eye_color • hair_color • age_range • is_incentivized • is_staff
helpful_votes • unhelpful_votes • helpfulness_score
photo_id • review_id • photo_url • caption
author_id (PK) • total_reviews • avg_rating • organic_reviews • first_review_date
created_at • updated_at • source_client • campaign_id
Technical Challenges
Building this pipeline wasn't straightforward. Here are the hardest problems we solved:
Challenge 1: Rate Limiting at Scale
Scraping 5.5 million reviews at 1 request per 2 seconds would take 127 days of continuous operation. We needed to parallelize while respecting rate limits.
Solution: We implemented a distributed scraping system with 8 workers, each targeting different product categories. Combined with intelligent backoff and checkpointing, we completed the scrape in under a week.
Challenge 2: Polarized Products
Some products look great on average but have huge variance:
Example: "Controversial Foundation X"
5-star reviews: 60%
4-star reviews: 10%
3-star reviews: 5%
2-star reviews: 5%
1-star reviews: 20% ← Warning sign!
Average: 4.05 stars (looks good!)
Reality: "Love it or hate it" productSolution: Our polarization penalty kicks in when rating standard deviation is greater than 1.5 AND more than 15% of reviews are 1-2 stars. These products get demoted.
Challenge 3: The Weak Labeling Problem
We can't manually label 4.4 million reviews as "authentic" or "suspicious." We needed a way to train our ML models without ground truth labels.
Solution: We used the is_incentivized flag as a proxy label. Incentivized reviews aren't necessarily "fake," but they're systematically different. By training our model to recognize incentivized patterns, we can also catch undisclosed promotional reviews.
Challenge 4: Sentiment-Rating Mismatch
"4 stars: This is my absolute holy grail product! I've repurchased 5 times and will never use anything else!"
This reviewer gave 4 stars but wrote a 5-star review. If we only look at the star rating, we're missing the true signal.
Solution: Our sentiment model extracts the emotional content of the text. When sentiment significantly exceeds the star rating, we boost the effective rating. "Strict raters" don't penalize products.
Challenge 5: Power User Identification
Some reviewers are more valuable than others. A beauty blogger with 500 reviews has more expertise than a first-time reviewer.
Solution: We define "power users" as reviewers with 21+ organic reviews. Their opinions are weighted more heavily, and products they love get a boost.
Results & Insights
After building this entire pipeline, what did we learn?
Key Findings
- Star ratings are inflated by ~0.5 stars on average due to incentivized reviews. The "true" organic rating is almost always lower than the displayed average.
- The community knows who to trust. Organic reviews get 8.3× more helpful votes than incentivized reviews. This crowd wisdom is a powerful signal.
- Review length is a suspicion signal. Longer reviews are more likely to be incentivized (obligation-driven), not necessarily better.
- Power users are harder to impress. Products loved by veteran reviewers tend to be genuinely excellent.
- Diversity matters. Products that work across skin types and tones score higher and are more universally loved.
Ranking Improvements
When we compare rankings by raw star average vs. our Love Score, many products move significantly:
- Products that moved UP: High organic quality, strong power user endorsement, good diversity scores
- Products that moved DOWN: Heavy incentivized review programs, high staff review percentage, polarized ratings
The Love Score surfaces products that are genuinely loved by paying customers. It filters out the marketing noise and finds the gems.
Glossary
Key terms used throughout this article:
Conclusion
We set out to answer a simple question: "Which beauty products are genuinely loved?"
The answer required building an entire machine learning pipeline: scraping millions of reviews, cleaning and normalizing the data, training models to detect quality and authenticity, and developing a scoring algorithm that surfaces genuine customer love.
Along the way, we discovered that star ratings are systematically biased by incentivized reviews, that the community's "helpful" votes are a powerful signal of authenticity, and that the gap between marketing and reality can be measured with data.
Products ranked by genuine customer love, not marketing spend.
The beauty industry is built on trust. Our system helps restore that trust by surfacing what real customers actually think — not what brands want you to believe.