Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Michael O. Harding; Vikas Singh; Kirthevasan Kandasamy

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy

TL;DR

The paper addresses budgeted, multi-source data collection with heterogeneous source populations. It introduces the effective sample size, grounded in the χ^2-divergence between the target and source mixtures, and shows that maximizing this quantity yields minimax-optimal risk for population mean and group-conditional mean estimation when paired with a post-stratified estimator; it further extends the framework to prediction via importance-weighted ERM. The results establish matching lower and upper bounds up to lower-order terms and demonstrate practical gains through simulations. This provides a principled, cost-aware data collection strategy applicable to medical studies, polling, and other domains where data come from diverse, costly sources.

Abstract

Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by $D_{χ^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{χ^2}$ is the $χ^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

TL;DR

Abstract

, where

is the target distribution,

is the aggregated source distribution, and

is the

-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

Paper Structure (31 sections, 25 theorems, 101 equations, 3 figures, 1 table)

This paper contains 31 sections, 25 theorems, 101 equations, 3 figures, 1 table.

Introduction
Multi-source data collection
Model
Environment
Learning with respect to a target group distribution
Multi-source data collection under a budget
Summary of our contributions and techniques
Effective sample size
Lower bound (§ \ref{['sec:me-lbs']})
Upper bound (§\ref{['sec:mean-est']})
Prediction problems (§ \ref{['sec:prediction']})
Empirical evaluation (App. \ref{['sec:experiments']})
Related Work
Sampling techniques
Effective sample size
...and 16 more sections

Key Result

Theorem 1

The minimax risk eqn:mmriskpm satisfies the following lower bound, where $\bm{n}^\star_T$ is the sampling plan $\bm{n}$ which maximizes $n_{\rm{eff}}(\bm{n}, q)$ subject to the constraint $\bm{c}^\top \bm{n} \leq B$. We have,

Figures (3)

Figure 1: Estimated risk based on 100 simulations in each setting. Error regions represent empirical average $\pm$ 2 SE. Row 1: Population mean under $u_K\,$. Row 2: vector of group means. Row 3: Binary classification under $u_K$.
Figure 2: Estimated risk based on 100 simulations in each setting. Error regions represent empirical average $\pm$ 2 SE. Row 1: Population mean. Row 2: Binary classification.
Figure 3: Estimated risk based on 100 simulations in each setting. Error regions represent empirical average $\pm$ 2 SE. Row 1: Population mean. Row 2: Binary classification.

Theorems & Definitions (44)

Example
Theorem 1: Informal
Theorem 2: Informal
Theorem 3: Informal
Theorem 4: Informal
Theorem 4
proof
Lemma 1
Lemma 2
Theorem 5
...and 34 more

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

TL;DR

Abstract

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (44)