System Design: A/B Testing Platform — Experiment Assignment, Metric Collection, and Statistical Analysis

Requirements

An A/B testing platform enables product teams to run controlled experiments: show variant A to 50% of users and variant B to the other 50%, collect outcome metrics, and determine which variant wins with statistical confidence. Core requirements: consistent assignment (a user always sees the same variant), bucketing at scale (millions of assignments per second), metric collection (clicks, conversions, revenue), statistical analysis (p-values, confidence intervals), and experiment management (define, launch, stop, archive). A/B testing is how companies like Netflix, Airbnb, and LinkedIn make data-driven product decisions. Netflix runs 250+ concurrent experiments. LinkedIn runs thousands per year. This is a common system design question at data-driven product companies.

Experiment Assignment

Assignment must be: consistent (same user always gets same variant), random (no systematic bias), and fast (< 1ms, called on every page load). Hashing-based assignment: hash(user_id + experiment_id) mod 100 gives a stable bucket number 0-99. Assign users with bucket 0-49 to control, 50-99 to treatment. The hash is deterministic — same inputs always produce the same output. No state required. No database lookup needed. Algorithm: SHA-256 or MurmurHash3 (faster, non-cryptographic). Why add experiment_id to the hash: without it, users in the same 50% bucket would be in the same bucket for every experiment — introducing correlation. Combining user_id and experiment_id ensures independent assignments across experiments. Traffic allocation: flexible — 50/50, 80/20 (ramp test), 33/33/33 (three-way test). Bucket ranges are computed from the allocation percentages. Holdout groups: reserve a permanent holdout (e.g., 5% of users never see any experiment) to measure the cumulative effect of all experiments over time. Mutual exclusion: experiments on the same surface should be mutually exclusive (a user can only be in one experiment at a time). Layer-based architecture: group mutually exclusive experiments into the same layer. Each layer has non-overlapping buckets.

Event Collection and Metric Pipeline

Every user action that might be a metric is logged as an event: page view, click, add to cart, purchase, session duration, error. Event schema: {user_id, experiment_id, variant, event_type, timestamp, properties (JSON)}. Collection path: client SDK → event ingestion API → Kafka → stream processor → metrics store. SDK: JavaScript/mobile SDK intercepts clicks and logs events. Intercepts checkout completion for revenue metrics. Batches events (send every 5 seconds or on page unload) to reduce HTTP requests. Assignment event: logged when a user is first assigned to an experiment. All subsequent events join on user_id + experiment_id. Metric aggregation: for each experiment + variant: count of users exposed, count of conversions, sum of revenue. Pre-aggregate per hour in the stream processor (Flink). Store in ClickHouse: (experiment_id, variant, metric_name, bucket_hour, value, user_count). Dashboard queries aggregate over the desired time window.

Statistical Analysis

Goal: determine if the observed difference between variants is statistically significant or could be due to chance. Two-sample t-test for continuous metrics (revenue per user): compute t-statistic = (mean_A – mean_B) / sqrt(var_A/n_A + var_B/n_B). Convert to p-value. If p < 0.05: reject the null hypothesis (the difference is significant at 95% confidence). Chi-squared test for binary metrics (conversion rate): compare observed vs expected conversion counts. Multiple testing problem: running 10 metrics per experiment and declaring significance at p < 0.05 means a 40% chance of a false positive somewhere. Fix: Bonferroni correction (divide alpha by number of metrics: 0.05/10 = 0.005 threshold per metric) or use False Discovery Rate (Benjamini-Hochberg). Sequential testing: in traditional A/B testing, you must pre-commit to a sample size and not peek at results early — peeking inflates false positive rates. Sequential testing (e.g., always-valid p-values, mSPRT) allows continuous monitoring without inflating error rates. Power analysis: before launching an experiment, compute the required sample size to detect a minimum detectable effect (MDE) with desired power (80%) and significance (95%). Ensures the experiment collects enough data before concluding. Minimum runtime: typically 1-2 full business cycles (1-2 weeks) to account for day-of-week effects.

Experiment Management and Guardrails

Experiment lifecycle: DRAFT → REVIEW → RUNNING → STOPPED → ARCHIVED. Review: product, engineering, and data science sign off before launch. Guardrail metrics: in addition to the primary metric (conversion rate), monitor guardrail metrics that should not regress: latency, error rate, revenue per user. If a guardrail regresses significantly: automatically stop the experiment and alert the team. This prevents a “winning” experiment from harming the business in unmeasured ways. Ramp-up: start with 1% of traffic, verify no errors or guardrail regressions, then ramp to 10%, 50%, 100% over hours or days. Reduces blast radius of bugs. Interaction effects: two experiments running on the same page may interact. The layer system prevents full interaction, but experiments in different layers can still have interactions. CUPED (Controlled-experiment Using Pre-Experiment Data): use the user’s pre-experiment behavior as a covariate to reduce variance and detect smaller effects with the same sample size. Standard practice at advanced experimentation platforms.

See also: Databricks Interview Prep

See also: Meta Interview Prep

See also: Netflix Interview Prep

Scroll to Top