Robust Estimation Under Heterogeneous Corruption Rates
Syomantak Chaudhuri, Jerry Li, Thomas A. Courtade
TL;DR
This work introduces a natural generalization of robust estimation to heterogeneous corruption rates, formalized via $\bm{\lambda}$-contamination where each sample is independently corrupted with probability $\lambda_i$. The authors establish tight minimax and PAC minimax rates for mean estimation under bounded and Gaussian distributions and for Gaussian linear regression, revealing that optimal estimators effectively discard high-corruption samples beyond a data-dependent threshold and, in some regimes, can improve through per-sample reweighting. Central methods include thresholding-based robust estimators and a family of per-sample weighted Tukey-type procedures that leverage the heterogeneity of corruption, with near-linear-time algorithms for implementing the optimal thresholding rule. Lower bounds are developed using Le Cam and Assouad techniques adapted to heterogeneous corruption, and the results show the minimax rate is governed by an effective rate function $f(\bm{\lambda},k)$ that depends on the distribution of $\lambda_i$. The findings have practical implications for distributed, federated, crowdsourced, and sensor-network settings, where corruption rates naturally vary across data sources, informing when to discard data and how to weight samples to achieve near-optimal estimation under heterogeneity.
Abstract
We study the problem of robust estimation under heterogeneous corruption rates, where each sample may be independently corrupted with a known but non-identical probability. This setting arises naturally in distributed and federated learning, crowdsourcing, and sensor networks, yet existing robust estimators typically assume uniform or worst-case corruption, ignoring structural heterogeneity. For mean estimation for multivariate bounded distributions and univariate gaussian distributions, we give tight minimax rates for all heterogeneous corruption patterns. For multivariate gaussian mean estimation and linear regression, we establish the minimax rate for squared error up to a factor of $\sqrt{d}$, where $d$ is the dimension. Roughly, our findings suggest that samples beyond a certain corruption threshold may be discarded by the optimal estimators -- this threshold is determined by the empirical distribution of the corruption rates given.
