Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

Zehao Xu; Tony Paek; Kevin O'Sullivan; Attila Dobi

Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

Zehao Xu, Tony Paek, Kevin O'Sullivan, Attila Dobi

TL;DR

This work presents a scalable, low-latency prevalence measurement framework that enables scalable, low-latency prevalence measurement in experimentation without requiring per-experiment labeling jobs.

Abstract

Online media platforms often need to measure how frequently users are exposed to specific content attributes in order to evaluate trade-offs in A/B experiments. A direct approach is to sample content, label it using a high-quality rubric (e.g., an expert-reviewed LLM prompt), and estimate impression-weighted prevalence. However, repeatedly running such labeling for every experiment arm and segment is too costly and slow to serve as a default measurement at scale. We present a scalable \emph{surrogate-based prevalence measurement} framework that decouples expensive labeling from per-experiment evaluation. The framework calibrates a surrogate signal to reference labels offline and then uses only impression logs to estimate prevalence for arbitrary experiment arms and segments. We instantiate this framework using \emph{score bucketing} as the surrogate: we discretize a model score into buckets, estimate bucket-level prevalences from an offline labeled sample, and combine these calibrated bucket level prevalences with the bucket distribution of impressions in each arm to obtain fast, log-based estimates. Across multiple large-scale A/B tests, we validate that the surrogate estimates closely match the reference estimates for both arm-level prevalence and treatment--control deltas. This enables scalable, low-latency prevalence measurement in experimentation without requiring per-experiment labeling jobs.

Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

TL;DR

Abstract

Paper Structure (25 sections, 16 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 16 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Prevalence Estimation
Notation
Sampling Design and Hansen--Hurwitz Estimator
Weighted Reservoir Sampling
LLM Labeling
Methodology: ML Score Surrogate Prevalence
Cost.
Latency.
Bucket-Level Prevalence
Prevalence Estimation and Variance Propagation
Experimental Results
Experiment A: Target-Category Filtering
Experiment B: UI-Only Change with No Expected Prevalence Shift
Implementation
...and 10 more sections

Figures (3)

Figure 1: Bucket-level impression-share shifts for categories $k_1$ and $k_2$ in Experiment A. The treatment arm reduces exposure primarily by shifting impressions out of higher score buckets.
Figure 2: Comparison of score-bucket distributions under two sampling schemes. Left: sampling with weights proportional to impressions alone produces a very low-score-heavy sample, leaving few examples in high-score buckets. Right: sampling with weights proportional to impression $\times$ model score yields a more balanced distribution across buckets, improving the precision of bucket-level prevalence estimates.
Figure 3: Relative prevalence reduction vs. calendar day for category $k_1$ in Experiment C. The values are consistently negative over the experiment window, and the day-level aggregation yields a statistically significant $p$-value.

Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

TL;DR

Abstract

Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

Authors

TL;DR

Abstract

Table of Contents

Figures (3)