Table of Contents
Fetching ...

Active Statistical Inference

Tijana Zrnic, Emmanuel J. Candès

TL;DR

Active inference provides a principled framework for inference under labeling budgets by adaptively selecting which data points to label using uncertainty-based predictions from a black-box model. The authors develop batch and sequential variants that yield provably valid confidence intervals and tests for general convex M-estimation targets, with oracle and practical sampling rules to minimize asymptotic variance. The approach consistently achieves substantial sample-efficiency gains across diverse real-world tasks (post-election surveys, census analysis, and AlphaFold-assisted proteomics), reporting budget savings often exceeding 70–80% relative to classical methods and 20–25% relative to non-adaptive PPI baselines, while preserving correct coverage. This work demonstrates that strategic, model-guided data collection can dramatically enhance the power of statistical inference in data-limited settings without sacrificing validity, enabling cost-effective high-stakes analyses in social science and life sciences domains.

Abstract

Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.

Active Statistical Inference

TL;DR

Active inference provides a principled framework for inference under labeling budgets by adaptively selecting which data points to label using uncertainty-based predictions from a black-box model. The authors develop batch and sequential variants that yield provably valid confidence intervals and tests for general convex M-estimation targets, with oracle and practical sampling rules to minimize asymptotic variance. The approach consistently achieves substantial sample-efficiency gains across diverse real-world tasks (post-election surveys, census analysis, and AlphaFold-assisted proteomics), reporting budget savings often exceeding 70–80% relative to classical methods and 20–25% relative to non-adaptive PPI baselines, while preserving correct coverage. This work demonstrates that strategic, model-guided data collection can dramatically enhance the power of statistical inference in data-limited settings without sacrificing validity, enabling cost-effective high-stakes analyses in social science and life sciences domains.

Abstract

Inspired by the concept of active learning, we propose active inferencea methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.
Paper Structure (31 sections, 4 theorems, 50 equations, 8 figures, 2 algorithms)

This paper contains 31 sections, 4 theorems, 50 equations, 8 figures, 2 algorithms.

Key Result

Proposition 5.1

Suppose that there exists $\eta^*\in\mathcal{H}$ such that $\mathbb{P}(\hat{\eta} \neq \eta^*)\to 0$. Then where $\sigma_*^2 = \mathrm{Var}(f(X) + (Y-f(X))\frac{\xi^{\eta^*}}{\pi_{\eta^*}(X)})$ and $\xi^{\eta^*}\sim\mathrm{Bern}(\pi_{\eta^*}(X))$. Consequently, for any $\hat{\sigma}^2\stackrel{p}{\to}\sigma_*^2$, $\mathcal{C}_{\alpha} = (\hat{\theta}^{{\hat{\eta}}} \pm z_{1-\alpha/2}\frac{\hat{\s

Figures (8)

  • Figure 1: Post-election survey research. Example intervals in five randomly chosen trials (left), average confidence interval width (middle), and coverage (right) for the average approval of Joe Biden's (top) and Donald Trump's (bottom) political messaging to the country following the 2020 US presidential election.
  • Figure 2: Census data analysis. Example intervals in five randomly chosen trials (left), average confidence interval width (middle), and coverage (right) for the linear regression coefficient quantifying the relationship between age and income, controlling for sex, in US Census data.
  • Figure 3: AlphaFold-assisted proteomics research. Example intervals in five randomly chosen trials (left), average confidence interval width (middle), and coverage (right) for the odds ratio between phosphorylation and being part of an IDR.
  • Figure 4: Save in sample budget due to active inference. Reduction in sample size required to achieve the same confidence interval width with active inference and (top) classical inference and (bottom) uniform sampling, respectively, across the applications shown in Figures \ref{['fig:pew79_batch']}-\ref{['fig:alphafold']}.
  • Figure 5: Post-election survey research with fine-tuning. Example intervals in five randomly chosen trials (left), average confidence interval width (middle), and coverage (right) for the average approval of Joe Biden's (top) and Donald Trump's (bottom) political messaging to the country following the 2020 US presidential election. Active inference with no fine-tuning and inference with uniformly sampled data use the same model.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Example 1: Mean label
  • Example 2: Linear regression
  • Example 3: Label quantile
  • Proposition 5.1
  • Theorem 5.1: CLT for batch active inference
  • Proposition 6.1
  • Theorem 6.1: CLT for sequential active inference