Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget
Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy
TL;DR
The paper addresses budgeted, multi-source data collection with heterogeneous source populations. It introduces the effective sample size, grounded in the χ^2-divergence between the target and source mixtures, and shows that maximizing this quantity yields minimax-optimal risk for population mean and group-conditional mean estimation when paired with a post-stratified estimator; it further extends the framework to prediction via importance-weighted ERM. The results establish matching lower and upper bounds up to lower-order terms and demonstrate practical gains through simulations. This provides a principled, cost-aware data collection strategy applicable to medical studies, polling, and other domains where data come from diverse, costly sources.
Abstract
Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by $D_{χ^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{χ^2}$ is the $χ^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.
