Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling

Attila Dobi; Aravindh Manickavasagam; Benjamin Thompson; Xiaohan Yang; Faisal Farooq

Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling

Attila Dobi, Aravindh Manickavasagam, Benjamin Thompson, Xiaohan Yang, Faisal Farooq

TL;DR

This paper addresses the challenge of measuring the prevalence of policy-violating content in user impressions, a metric that complements reports and enables timely interventions. It proposes a design-based daily measurement framework that combines ML-assisted probability sampling from impression logs with LLM-assisted labeling, producing design-consistent daily prevalence estimates with confidence intervals and flexible drill-downs. Key contributions include a design-consistent estimator enabling single-sample drill-downs, practical uncertainty quantification (CIs and ESS) with optional label-error correction, and a configurable workflow that ties policy definitions, SME prompts, and gold sets to a production-ready daily pipeline. Empirically, the approach yields substantial sampling-efficiency gains, supports multi-pivot analysis from a single sample, and scales to >$1$M items per day, enabling rapid, governance-grade monitoring and efficient experimentation for platform safety teams.

Abstract

Content safety teams need metrics that reflect what users actually experience, not only what is reported. We study prevalence: the fraction of user views (impressions) that went to content violating a given policy on a given day. Accurate prevalence measurement is challenging because violations are often rare and human labeling is costly, making frequent, platform-representative studies slow. We present a design-based measurement system that (i) draws daily probability samples from the impression stream using ML-assisted weights to concentrate label budget on high-exposure and high-risk content while preserving unbiasedness, (ii) labels sampled items with a multimodal LLM governed by policy prompts and gold-set validation, and (iii) produces design-consistent prevalence estimates with confidence intervals and dashboard drilldowns. A key design goal is one global sample with many pivots: the same daily sample supports prevalence by surface, viewer geography, content age, and other segments through post-stratified estimation. We describe the statistical estimators, variance and confidence interval construction, label-quality monitoring, and an engineering workflow that makes the system configurable across policies.

Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling

TL;DR

M items per day, enabling rapid, governance-grade monitoring and efficient experimentation for platform safety teams.

Abstract

Paper Structure (38 sections, 20 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 38 sections, 20 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Problem and Metric
Segments and drill-downs
ML-Assisted Probability Sampling
Sampling weights
Efficient implementation via weighted reservoir sampling
Design-Consistent Estimation and Single-Sample Drill-downs
Hansen--Hurwitz ratio estimator (PPS with replacement)
Without-replacement alternative
Drill-downs from a single sample
Uncertainty Quantification
Effective sample size diagnostics
Optional correction for label error
...and 23 more sections

Figures (4)

Figure 1: Illustrative example: prevalence trend over time (percent of impressions) with intervention markers. In production, estimates include 95% confidence intervals.
Figure 2: Example LLM outputs: structured label, rationale, and confidence. (Content blurred.)
Figure 3: Prevalence workflow: impression logs and auxiliary safety scores drive sampling; an LLM labeler produces policy labels; the estimator generates prevalence and CIs; results and lineage are stored for dashboards and audits.
Figure 4: Simulation POC: empirical 95% CI width vs. sample size $m$ for two sampling schemes: PPS ($w=\text{impressions}$) ML_PPS ($w=\text{impression} \times \text{model\_score}$)

Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling

TL;DR

Abstract

Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling

Authors

TL;DR

Abstract

Table of Contents

Figures (4)