Table of Contents
Fetching ...

Near-Exponential Savings for Mean Estimation with Active Learning

Julian M. Morimoto, Jacob Goldin, Daniel E. Ho

TL;DR

This work addresses efficient estimation of the mean $μ=\mathbb{E}[Y]$ for a $k$-class outcome under a limited label budget $N$ when informative covariates $X$ are available. It introduces PartiBandits, a two-stage active-learning framework that first learns a stratification to minimize within-stratum variance and then samples labels across strata with a WarmStart-UCB subroutine. Theoretical guarantees show near-exponential savings, with $|\hat{μ}_{\text{PB}}-μ|^2 = \tilde{O}\left( \frac{ν + \exp\left(c \cdot (-N/\log N)\right)}{N} \right)$, where $ν$ is the Bayes-optimal classifier risk, and a complementary rate $|\hat{μ}_{\text{WS-UCB}}-μ|^2 = \tilde{O}\left( \frac{\Sigma_1(\mathcal{G})}{N} \right)$ when a stratification $\mathcal{G}$ is available. The approach unifies disagreement-based and UCB-based active-learning philosophies, with minimax-optimal guarantees and empirical validation on large-scale health-record data. The results imply substantial label-efficiency gains in settings where covariates are informative, enabling accurate population mean estimates under tight labeling budgets. The work also provides practical tools, including an R implementation, to deploy PartiBandits in real-world studies.

Abstract

We study the problem of efficiently estimating the mean of a $k$-class random variable, $Y$, using a limited number of labels, $N$, in settings where the analyst has access to auxiliary information (i.e.: covariates) $X$ that may be informative about $Y$. We propose an active learning algorithm ("PartiBandits") to estimate $\mathbb{E}[Y]$. The algorithm yields an estimate, $\widehatμ_{\text{PB}}$, such that $\left( \widehatμ_{\text{PB}} - \mathbb{E}[Y]\right)^2$ is $\tilde{\mathcal{O}}\left( \frac{ν+ \exp(c \cdot (-N/\log(N))) }{N} \right)$, where $c > 0$ is a constant and $ν$ is the risk of the Bayes-optimal classifier. PartiBandits is essentially a two-stage algorithm. In the first stage, it learns a partition of the unlabeled data that shrinks the average conditional variance of $Y$. In the second stage it uses a UCB-style subroutine ("WarmStart-UCB") to request labels from each stratum round-by-round. Both the main algorithm's and the subroutine's convergence rates are minimax optimal in classical settings. PartiBandits bridges the UCB and disagreement-based approaches to active learning despite these two approaches being designed to tackle very different tasks. We illustrate our methods through simulation using nationwide electronic health records. Our methods can be implemented using the PartiBandits package in R.

Near-Exponential Savings for Mean Estimation with Active Learning

TL;DR

This work addresses efficient estimation of the mean for a -class outcome under a limited label budget when informative covariates are available. It introduces PartiBandits, a two-stage active-learning framework that first learns a stratification to minimize within-stratum variance and then samples labels across strata with a WarmStart-UCB subroutine. Theoretical guarantees show near-exponential savings, with , where is the Bayes-optimal classifier risk, and a complementary rate when a stratification is available. The approach unifies disagreement-based and UCB-based active-learning philosophies, with minimax-optimal guarantees and empirical validation on large-scale health-record data. The results imply substantial label-efficiency gains in settings where covariates are informative, enabling accurate population mean estimates under tight labeling budgets. The work also provides practical tools, including an R implementation, to deploy PartiBandits in real-world studies.

Abstract

We study the problem of efficiently estimating the mean of a -class random variable, , using a limited number of labels, , in settings where the analyst has access to auxiliary information (i.e.: covariates) that may be informative about . We propose an active learning algorithm ("PartiBandits") to estimate . The algorithm yields an estimate, , such that is , where is a constant and is the risk of the Bayes-optimal classifier. PartiBandits is essentially a two-stage algorithm. In the first stage, it learns a partition of the unlabeled data that shrinks the average conditional variance of . In the second stage it uses a UCB-style subroutine ("WarmStart-UCB") to request labels from each stratum round-by-round. Both the main algorithm's and the subroutine's convergence rates are minimax optimal in classical settings. PartiBandits bridges the UCB and disagreement-based approaches to active learning despite these two approaches being designed to tackle very different tasks. We illustrate our methods through simulation using nationwide electronic health records. Our methods can be implemented using the PartiBandits package in R.

Paper Structure

This paper contains 13 sections, 10 theorems, 50 equations, 5 figures, 3 algorithms.

Key Result

Theorem 1

$\left| \widehat{\mu}_{\text{WS-UCB}} - \mathbb{E}[Y] \right|^2 = \tilde{\mathcal{O}}\left( \frac{\Sigma_1(\mathcal{G})}{N} \right)$.

Figures (5)

  • Figure 1: This plot compares the performance of PartiBandits and WarmStart-UCB, to SRS in different problem settings. The left panel compares SRS to PartiBandits for label budgets from $10$ to $100$. Here, $X \sim \text{Unif}[0,1]$ and $Y = \textbf{1}\left\{ X \geq 0.5\right\}$, with a fixed fraction of $Y$'s (between 0% and 10%) randomly flipped to introduce noise. The proportion of flipped labels is equal to $\nu$ by definition. For each label budget, we generate 500 hypothetical datasets in this way, apply SRS and PartiBandits to each, and compute the resulting error rates. We then take the 90th percentile of these error rates to obtain a classical 90% high-probability/confidence bound. PartiBandits eventually outperforms SRS with relatively fewer samples, with performance gains becoming more pronounced when $X$ better predicts $Y$ and $\nu$ decreases. The right panel compares SRS to WarmStart-UCB for label budgets from $50$ to $200$. In this panel, $X \sim \text{Unif}[0,1]$ and $Y = \textbf{1}\left\{ X \geq 0.5\right\}$, with 5% of the labels randomly flipped to introduce noise. We examine the effect of specifying different stratification schemes beforehand that reduce the within-group variance of $Y$ to varying degrees, where lower values of $\Sigma_1(\mathcal{G})$ indicate better average within-group variance reduction. Each scheme defines strata by applying a threshold between 0.3 and 0.5 and grouping observations based on whether $X$ falls to the left or right of the threshold. We run the same simulation procedure as above to obtain the 90% confidence bounds. WarmStart-UCB consistently outperforms SRS, and the gap grows when stratification reduces variance more effectively (i.e., when $\Sigma_1(\mathcal{G})$ shrinks).
  • Figure 2: Comparison of estimation error for different label budgets using the AFC data.
  • Figure 3: This plot compares the performance of PartiBandits to SRS when the labels are generated according to the following logistic data generating process: $X \sim \mathrm{Unif}[0,1]$ and $Y \sim \mathrm{Bernoulli}\!\left(\frac{1}{1 + \exp[-(\beta_0 + \beta_1 X)]}\right)$, where $\beta_0 = -1/\nu$ and $\beta_1 = 2/\nu$. This corresponds to a Logit-type DGP, with $1/\nu$ governing the steepness of the logistic curve. For each label budget, we generate 500 hypothetical datasets in this way, apply SRS and PartiBandits to each, and compute the resulting error rates. We then take the 90th percentile of these error rates to obtain a classical 90% high-probability/confidence bound. PartiBandits eventually outperforms SRS with relatively fewer samples, with performance gains becoming more pronounced when $X$ better predicts $Y$ and $\nu$ decreases.
  • Figure 4: This plot compares the performance of PartiBandits to SRS when the labels are generated according to the following asymmetric probit data generating process: $X \sim \mathrm{Unif}[-5,5]$ and $Y \sim \mathrm{Bernoulli}\!\left(\Phi\bigl((1/\nu)\,(X - 0.25)\bigr)\right)$, where $\Phi(\cdot)$ denotes the standard normal CDF. This corresponds to a Probit-type DGP, with $1/\nu$ controlling the steepness of the probability curve and $X \approx 0.25$ marking the midpoint threshold. For each label budget, we generate 500 hypothetical datasets in this way, apply SRS and PartiBandits to each, and compute the resulting error rates. We then take the 90th percentile of these error rates to obtain a classical 90% high-probability/confidence bound. PartiBandits eventually outperforms SRS with relatively fewer samples, with performance gains becoming more pronounced when $X$ better predicts $Y$ and $\nu$ decreases.
  • Figure 5: This plot compares the performance of PartiBandits to SRS and Thompson sampling and SRS for label budgets from $10$ to $100$. Here, $X \sim \text{Unif}[0,1]$ and $Y = \textbf{1}\left\{ X \geq 0.5\right\}$, with a fixed fraction of $Y$'s (between 0% and 10%) randomly flipped to introduce noise. The proportion of flipped labels is equal to $\nu$ by definition. For each label budget, we generate 500 hypothetical datasets in this way, apply SRS, Thompson sampling, and PartiBandits to each, and compute the resulting error rates. To execute the Thompson sampling, we use the standard Beta-Bernoulli Thompson Sampling algorithm with an uninformative prior $\mathrm{Beta}(1,1)$. At each round, the algorithm samples a success probability from each arm’s posterior, selects the arm with the highest draw, observes a Bernoulli reward, and updates the corresponding posterior. In our setup, we ran $T = 3000$ rounds with $K = 3$ arms (true $p = (0.1, 0.5, 0.8)$) for the prototype and $K = 5$ bins over $[0,1]$ with a threshold of $0.5$ for the binned variant. We then take the 90th percentile of these error rates to obtain a classical 90% high-probability/confidence bound. PartiBandits eventually outperforms SRS and Thompson sampling with relatively fewer samples, with performance gains becoming more pronounced when $X$ better predicts $Y$ and $\nu$ decreases. We also observe that, over time, Thompson sampling ceases to yield better mean estimates, consistent with theoretical results suggesting that this procedure can yield biased mean estimates in common settings shin_are_2019.

Theorems & Definitions (24)

  • Theorem 1
  • Theorem 2: Lower Bound for WarmStart-UCB
  • Theorem 3
  • Corollary 1: Classical Binary case with low noise
  • Corollary 2: Binary, weaker structural conditions on $\mathcal{C}$
  • Corollary 3: Multiclass
  • Corollary 4: Heterogeneity-Aware $\mathcal{S}$
  • Theorem 4: Lower Bound for PartiBandits
  • Lemma 1
  • proof
  • ...and 14 more