Glossary

Glossary

Plain-English definitions of terms used throughout this documentation.


Review Terms

Organic Review

A review written without any incentive — no free product, no discount, no compensation. These are the most trustworthy reviews because the person had no external motivation to be positive.

Incentivized Review

A review where the reviewer received something in exchange — typically free product. Sephora marks these explicitly. They tend to be more positive than organic reviews (by about 0.3-0.5 stars on average).

Substantive Review

A review with actual content — more than 50 words. "Great product!" is not substantive. A paragraph explaining what worked and what didn't is substantive.

Helpfulness Ratio

The percentage of votes on a review that are "helpful" rather than "not helpful." Formula: helpful_votes / (helpful_votes + unhelpful_votes). A high ratio means the community validates this review.


User Terms

Power User

A reviewer who has written 21+ organic reviews. These people are experienced reviewers, harder to impress, and more likely to give balanced feedback. Their opinions carry extra weight.

One-Time Reviewer

Someone who has only written one review ever. Common but less trustworthy — we don't know if they're harsh or generous raters. Could also be a throwaway account for fake reviewing.

Staff Member

A Sephora employee. Their reviews are marked in the data. While they might have product knowledge, they also have incentive to promote products.


Score Terms

Love Score

Our custom metric (0-1) that identifies products people genuinely love. Combines organic quality, engagement, authenticity, diversity, and trend — with confidence multiplier and adjustments.

Confidence Multiplier

A factor that scales scores based on how much data we have. A product with 3 reviews might have great metrics, but we can't trust it. Confidence saturates at ~150 organic reviews.

Rating Inflation

When paid/incentivized reviewers give higher ratings than organic reviewers for the same product. A gap of 0.8+ stars triggers a penalty.

Polarization

When a product has both lovers and haters — high rating variance (std > 1.5) combined with many negative reviews (>15%). "Love it or hate it" products get penalized.


Data Terms

Parquet

A file format for storing tabular data that's compressed and fast to query. Think of it as a super-efficient spreadsheet format. Reads only the columns you need, not the whole file.

DuckDB

A database that can query Parquet files directly without loading them into memory. Like having SQL power without setting up a database server. Very fast for analytical queries.

JSONL

"JSON Lines" — a text file where each line is a separate JSON object. Easy to process line-by-line, append to, and stream. Used for intermediate data storage.

Star Schema

A database design pattern where one central table (the "fact" table — reviews) connects to multiple satellite tables (the "dimensions" — users, engagement, photos). Reduces redundancy and speeds up queries.


ML Terms

Feature

A measurable property used as input to a machine learning model. Examples: text length, rating, helpful votes, user review count. Our fake detection model uses 68+ features.

Ensemble

Combining multiple models to get better predictions than any single model. We ensemble traditional ML with BERT for fake detection — averaging their probabilities.

BERT

A language model from Google that understands text context. "Bank" means something different in "river bank" vs "bank account." We use DistilBERT (a smaller, faster version) for sentiment and fake detection.

Sentiment Score

A measure of emotional tone (0-1). Low = negative ("terrible product"), high = positive ("love this!"), middle = neutral ("it's okay").


Business Terms

BazaarVoice

The third-party service that powers Sephora's review system. They handle review collection, moderation, and provide the API we scrape from. Many major retailers use them.

ETL

Extract, Transform, Load — the process of taking raw data, cleaning it up, and putting it in a usable format. Our cleaning stage is the ETL step.

Scraping

Automatically collecting data from websites or APIs. Not the same as hacking — we're using public APIs and following rate limits.


Metric Terms

Organic Ratio

Percentage of a product's reviews that are organic (not incentivized). Formula: organic_reviews / total_reviews. Higher is better for authenticity.

Engagement Quality

A composite score measuring community validation: helpful votes, photo inclusion, and review substantiveness.

Trend Score

How active a product is recently. Formula: reviews_last_180_days / total_reviews. A fading product has low trend; a hot product has high trend.

Diversity Score

How well a product works across different skin types and tones. Calculated from the variety of demographics in positive reviews.


That's It!

You've reached the end of the documentation. Return to Executive Summary → or explore any section from the sidebar.