Measuring the Prevalence of Policy Violating Content with ML Assisted Sampling and LLM Labeling
Attila Dobi, Aravindh Manickavasagam, Benjamin Thompson, Xiaohan Yang, Faisal Farooq
TL;DR
This paper addresses the challenge of measuring the prevalence of policy-violating content in user impressions, a metric that complements reports and enables timely interventions. It proposes a design-based daily measurement framework that combines ML-assisted probability sampling from impression logs with LLM-assisted labeling, producing design-consistent daily prevalence estimates with confidence intervals and flexible drill-downs. Key contributions include a design-consistent estimator enabling single-sample drill-downs, practical uncertainty quantification (CIs and ESS) with optional label-error correction, and a configurable workflow that ties policy definitions, SME prompts, and gold sets to a production-ready daily pipeline. Empirically, the approach yields substantial sampling-efficiency gains, supports multi-pivot analysis from a single sample, and scales to >$1$M items per day, enabling rapid, governance-grade monitoring and efficient experimentation for platform safety teams.
Abstract
Content safety teams need metrics that reflect what users actually experience, not only what is reported. We study prevalence: the fraction of user views (impressions) that went to content violating a given policy on a given day. Accurate prevalence measurement is challenging because violations are often rare and human labeling is costly, making frequent, platform-representative studies slow. We present a design-based measurement system that (i) draws daily probability samples from the impression stream using ML-assisted weights to concentrate label budget on high-exposure and high-risk content while preserving unbiasedness, (ii) labels sampled items with a multimodal LLM governed by policy prompts and gold-set validation, and (iii) produces design-consistent prevalence estimates with confidence intervals and dashboard drilldowns. A key design goal is one global sample with many pivots: the same daily sample supports prevalence by surface, viewer geography, content age, and other segments through post-stratified estimation. We describe the statistical estimators, variance and confidence interval construction, label-quality monitoring, and an engineering workflow that makes the system configurable across policies.
