Table of Contents
Fetching ...

Robust Estimation Under Heterogeneous Corruption Rates

Syomantak Chaudhuri, Jerry Li, Thomas A. Courtade

TL;DR

This work introduces a natural generalization of robust estimation to heterogeneous corruption rates, formalized via $\bm{\lambda}$-contamination where each sample is independently corrupted with probability $\lambda_i$. The authors establish tight minimax and PAC minimax rates for mean estimation under bounded and Gaussian distributions and for Gaussian linear regression, revealing that optimal estimators effectively discard high-corruption samples beyond a data-dependent threshold and, in some regimes, can improve through per-sample reweighting. Central methods include thresholding-based robust estimators and a family of per-sample weighted Tukey-type procedures that leverage the heterogeneity of corruption, with near-linear-time algorithms for implementing the optimal thresholding rule. Lower bounds are developed using Le Cam and Assouad techniques adapted to heterogeneous corruption, and the results show the minimax rate is governed by an effective rate function $f(\bm{\lambda},k)$ that depends on the distribution of $\lambda_i$. The findings have practical implications for distributed, federated, crowdsourced, and sensor-network settings, where corruption rates naturally vary across data sources, informing when to discard data and how to weight samples to achieve near-optimal estimation under heterogeneity.

Abstract

We study the problem of robust estimation under heterogeneous corruption rates, where each sample may be independently corrupted with a known but non-identical probability. This setting arises naturally in distributed and federated learning, crowdsourcing, and sensor networks, yet existing robust estimators typically assume uniform or worst-case corruption, ignoring structural heterogeneity. For mean estimation for multivariate bounded distributions and univariate gaussian distributions, we give tight minimax rates for all heterogeneous corruption patterns. For multivariate gaussian mean estimation and linear regression, we establish the minimax rate for squared error up to a factor of $\sqrt{d}$, where $d$ is the dimension. Roughly, our findings suggest that samples beyond a certain corruption threshold may be discarded by the optimal estimators -- this threshold is determined by the empirical distribution of the corruption rates given.

Robust Estimation Under Heterogeneous Corruption Rates

TL;DR

This work introduces a natural generalization of robust estimation to heterogeneous corruption rates, formalized via -contamination where each sample is independently corrupted with probability . The authors establish tight minimax and PAC minimax rates for mean estimation under bounded and Gaussian distributions and for Gaussian linear regression, revealing that optimal estimators effectively discard high-corruption samples beyond a data-dependent threshold and, in some regimes, can improve through per-sample reweighting. Central methods include thresholding-based robust estimators and a family of per-sample weighted Tukey-type procedures that leverage the heterogeneity of corruption, with near-linear-time algorithms for implementing the optimal thresholding rule. Lower bounds are developed using Le Cam and Assouad techniques adapted to heterogeneous corruption, and the results show the minimax rate is governed by an effective rate function that depends on the distribution of . The findings have practical implications for distributed, federated, crowdsourced, and sensor-network settings, where corruption rates naturally vary across data sources, informing when to discard data and how to weight samples to achieve near-optimal estimation under heterogeneity.

Abstract

We study the problem of robust estimation under heterogeneous corruption rates, where each sample may be independently corrupted with a known but non-identical probability. This setting arises naturally in distributed and federated learning, crowdsourcing, and sensor networks, yet existing robust estimators typically assume uniform or worst-case corruption, ignoring structural heterogeneity. For mean estimation for multivariate bounded distributions and univariate gaussian distributions, we give tight minimax rates for all heterogeneous corruption patterns. For multivariate gaussian mean estimation and linear regression, we establish the minimax rate for squared error up to a factor of , where is the dimension. Roughly, our findings suggest that samples beyond a certain corruption threshold may be discarded by the optimal estimators -- this threshold is determined by the empirical distribution of the corruption rates given.

Paper Structure

This paper contains 46 sections, 13 theorems, 166 equations, 2 figures, 1 algorithm.

Key Result

Theorem 1

Let $\cD_r^b$ be the set of all distributions on $\bbR^d$ supported on the $l_2$ ball of radius $r$. Then, Moreover, the optimal estimator can be implemented in nearly-linear time.

Figures (2)

  • Figure 1: Plot of weighted Tukey depth (see \ref{['eq:tukey-depth']}) visualized for three different weighing schemes. (A) is computed with the standard uniform weights $w_i = \frac{1}{n}$, (B) is computed with $w_i = \frac{\bbI\{\lambda_i \leq t \}}{|\{j:\lambda_j \leq t \}|}$ using the value of $t$ from \ref{['eq:t-exp']}, and (C) is computed with weights given by \ref{['alg:bounded']}. For the dataset, the true underlying distribution is $\cN((0,0),I)$, and $\lambda$ is sampled i.i.d. Points are contaminated by replacing them with samples from $\cN((2,2),I/5)$. The samples are marked in red ' x' if they were contaminated; the size of the markers for each point is proportional to $1-\lambda_i$. The estimated mean, the point with maximum depth, is marked with a yellow star.
  • Figure 2: Mean estimation algorithms for (a) bounded distributions and (b) univariate Gaussian distributions. The x-axis is a proxy for degree of contamination of the model.

Theorems & Definitions (19)

  • Definition 1: $\bm{\lambda}$-contamination
  • Definition 2: Minimax, Minimax PAC rates for heterogeneous robust mean estimation ma2024highchaudhuri2025privatee
  • Theorem 1
  • Theorem 2: informal, see \ref{['thm:gauss-minimax']}
  • Theorem 3: informal, see \ref{['thm:LR']}
  • Proposition 1: Upper Bound for Gaussian Distributions
  • Corollary 1: Minimax Rate for Univariate Gaussian Distributions
  • Theorem 4
  • Proposition 2: Upper Bound for Regression
  • Theorem 5: Minimax Rate for Linear Regression
  • ...and 9 more