Near-Exponential Savings for Mean Estimation with Active Learning

Julian M. Morimoto; Jacob Goldin; Daniel E. Ho

Near-Exponential Savings for Mean Estimation with Active Learning

Julian M. Morimoto, Jacob Goldin, Daniel E. Ho

TL;DR

This work addresses efficient estimation of the mean $μ=\mathbb{E}[Y]$ for a $k$-class outcome under a limited label budget $N$ when informative covariates $X$ are available. It introduces PartiBandits, a two-stage active-learning framework that first learns a stratification to minimize within-stratum variance and then samples labels across strata with a WarmStart-UCB subroutine. Theoretical guarantees show near-exponential savings, with $|\hat{μ}_{\text{PB}}-μ|^2 = \tilde{O}\left( \frac{ν + \exp\left(c \cdot (-N/\log N)\right)}{N} \right)$, where $ν$ is the Bayes-optimal classifier risk, and a complementary rate $|\hat{μ}_{\text{WS-UCB}}-μ|^2 = \tilde{O}\left( \frac{\Sigma_1(\mathcal{G})}{N} \right)$ when a stratification $\mathcal{G}$ is available. The approach unifies disagreement-based and UCB-based active-learning philosophies, with minimax-optimal guarantees and empirical validation on large-scale health-record data. The results imply substantial label-efficiency gains in settings where covariates are informative, enabling accurate population mean estimates under tight labeling budgets. The work also provides practical tools, including an R implementation, to deploy PartiBandits in real-world studies.

Abstract

We study the problem of efficiently estimating the mean of a $k$-class random variable, $Y$, using a limited number of labels, $N$, in settings where the analyst has access to auxiliary information (i.e.: covariates) $X$ that may be informative about $Y$. We propose an active learning algorithm ("PartiBandits") to estimate $\mathbb{E}[Y]$. The algorithm yields an estimate, $\widehatμ_{\text{PB}}$, such that $\left( \widehatμ_{\text{PB}} - \mathbb{E}[Y]\right)^2$ is $\tilde{\mathcal{O}}\left( \frac{ν+ \exp(c \cdot (-N/\log(N))) }{N} \right)$, where $c > 0$ is a constant and $ν$ is the risk of the Bayes-optimal classifier. PartiBandits is essentially a two-stage algorithm. In the first stage, it learns a partition of the unlabeled data that shrinks the average conditional variance of $Y$. In the second stage it uses a UCB-style subroutine ("WarmStart-UCB") to request labels from each stratum round-by-round. Both the main algorithm's and the subroutine's convergence rates are minimax optimal in classical settings. PartiBandits bridges the UCB and disagreement-based approaches to active learning despite these two approaches being designed to tackle very different tasks. We illustrate our methods through simulation using nationwide electronic health records. Our methods can be implemented using the PartiBandits package in R.

Near-Exponential Savings for Mean Estimation with Active Learning

TL;DR

This work addresses efficient estimation of the mean

for a

-class outcome under a limited label budget

when informative covariates

are available. It introduces PartiBandits, a two-stage active-learning framework that first learns a stratification to minimize within-stratum variance and then samples labels across strata with a WarmStart-UCB subroutine. Theoretical guarantees show near-exponential savings, with

, where

is the Bayes-optimal classifier risk, and a complementary rate

when a stratification

is available. The approach unifies disagreement-based and UCB-based active-learning philosophies, with minimax-optimal guarantees and empirical validation on large-scale health-record data. The results imply substantial label-efficiency gains in settings where covariates are informative, enabling accurate population mean estimates under tight labeling budgets. The work also provides practical tools, including an R implementation, to deploy PartiBandits in real-world studies.

Abstract

We study the problem of efficiently estimating the mean of a

-class random variable,

, using a limited number of labels,

, in settings where the analyst has access to auxiliary information (i.e.: covariates)

that may be informative about

. We propose an active learning algorithm ("PartiBandits") to estimate

. The algorithm yields an estimate,

, such that

, where

is a constant and

is the risk of the Bayes-optimal classifier. PartiBandits is essentially a two-stage algorithm. In the first stage, it learns a partition of the unlabeled data that shrinks the average conditional variance of

. In the second stage it uses a UCB-style subroutine ("WarmStart-UCB") to request labels from each stratum round-by-round. Both the main algorithm's and the subroutine's convergence rates are minimax optimal in classical settings. PartiBandits bridges the UCB and disagreement-based approaches to active learning despite these two approaches being designed to tackle very different tasks. We illustrate our methods through simulation using nationwide electronic health records. Our methods can be implemented using the PartiBandits package in R.

Near-Exponential Savings for Mean Estimation with Active Learning

TL;DR

Abstract

Near-Exponential Savings for Mean Estimation with Active Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (24)