Table of Contents
Fetching ...

Cross-Prediction-Powered Inference

Tijana Zrnic, Emmanuel J. Candès

TL;DR

Cross-prediction provides a principled semi-supervised inference framework that leverages large unlabeled datasets by imputing labels with machine learning and debiasing to achieve valid confidence statements. By employing cross-fitting over multiple models, it yields unbiased estimators with reduced variance when predictions are reasonably accurate, and it extends to general M-estimation with CLT-based inference. The approach demonstrates superior power and stability compared with prediction-powered inference and classical labeled-data inference across synthetic and real datasets, including deforestation from satellite imagery, ACS survey data, and Galaxy Zoo imagery. Practically, this method enables more efficient use of unlabeled data to obtain reliable inferences in domains with expensive labeling or measurements. The combination of cross-prediction and bootstrap-based variance estimation supports robust, scalable confidence intervals for a wide range of estimands, from means to quantiles and GLMs.

Abstract

While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability.

Cross-Prediction-Powered Inference

TL;DR

Cross-prediction provides a principled semi-supervised inference framework that leverages large unlabeled datasets by imputing labels with machine learning and debiasing to achieve valid confidence statements. By employing cross-fitting over multiple models, it yields unbiased estimators with reduced variance when predictions are reasonably accurate, and it extends to general M-estimation with CLT-based inference. The approach demonstrates superior power and stability compared with prediction-powered inference and classical labeled-data inference across synthetic and real datasets, including deforestation from satellite imagery, ACS survey data, and Galaxy Zoo imagery. Practically, this method enables more efficient use of unlabeled data to obtain reliable inferences in domains with expensive labeling or measurements. The combination of cross-prediction and bootstrap-based variance estimation supports robust, scalable confidence intervals for a wide range of estimands, from means to quantiles and GLMs.

Abstract

While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability.
Paper Structure (26 sections, 8 theorems, 69 equations, 10 figures, 4 tables)

This paper contains 26 sections, 8 theorems, 69 equations, 10 figures, 4 tables.

Key Result

Theorem 4.1

Let $\theta^*$ be the mean outcome, $\theta^* = \mathbb{E}[Y]$. Suppose that the predictions are stable (Ass. ass:stability). Further, assume that $\frac{n}{N}$ has a limit, and that $\bar{\sigma}^2 = \mathrm{Var}(\bar{f}(X))$ and $\bar{\sigma}_\Delta^2 = \mathrm{Var}(\bar{f}(X) - Y)$ have a nonzero

Figures (10)

  • Figure 1: Examples of GEE satellite imagery used in the deforestation analysis of bullock2020satellite.
  • Figure 2: Estimating the deforestation rate in the Amazon from satellite imagery. Left: Example intervals constructed by cross-prediction, classical inference, and prediction-powered inference (PPI), for five random splits into labeled and unlabeled data and a fixed number of gold-standard deforestation labels, $n=319$. Middle and right: Coverage and interval width averaged over $100$ random splits into labeled and unlabeled data, for $n\in\{319, 638, 957\}$. The target of inference is the fraction of the Amazon rainforest lost between 2000 and 2015 (gray line in left panel). The target coverage is $90\%$ (gray line in middle panel).
  • Figure 3: Mean estimation. Intervals from five randomly chosen trials (left), coverage (middle), and average interval width (right) of cross-prediction, classical inference, and prediction-powered inference (PPI) in a mean estimation problem.
  • Figure 4: Quantile estimation. Intervals from five randomly chosen trials (left), coverage (middle), and average interval width (right) of cross-prediction, classical inference, and prediction-powered inference (PPI) in a quantile estimation problem. The target is the 75th percentile.
  • Figure 5: Linear regression. Intervals from five randomly chosen trials (left), coverage (middle), and average interval width (right) of cross-prediction, classical inference, and prediction-powered inference (PPI) in a linear regression problem.
  • ...and 5 more figures

Theorems & Definitions (10)

  • Theorem 4.1: Cross-prediction CLT for the mean
  • Corollary 4.1: Inference for the mean via cross-prediction
  • Theorem 5.1: Cross-prediction CLT
  • Corollary 5.1: Inference via cross-prediction
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • Lemma A.4