Table of Contents
Fetching ...

High-Dimensional Robust Mean Estimation with Untrusted Batches

Maryam Aliakbarpour, Vladimir Braverman, Yuhan Liu, Junze Yin

TL;DR

Two Sum-of-Squares based algorithms achieve the minimax-optimal error rate, demonstrating that while heterogeneity $\alpha$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.

Abstract

We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$. In this environment, a learner seeks to recover the mean $μ$ of a true distribution $P$ from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an $\varepsilon$-fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to $P$, but deviate by a proximity parameter $α$. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of $\sqrtα$, or (2) an $α$-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate $O(\sqrt{\varepsilon/n} + \sqrt{d/nN} + \sqrtα)$, demonstrating that while heterogeneity $α$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.

High-Dimensional Robust Mean Estimation with Untrusted Batches

TL;DR

Two Sum-of-Squares based algorithms achieve the minimax-optimal error rate, demonstrating that while heterogeneity represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of due to the internal averaging afforded by the batch structure.

Abstract

We study high-dimensional mean estimation in a collaborative setting where data is contributed by users in batches of size . In this environment, a learner seeks to recover the mean of a true distribution from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an -fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to , but deviate by a proximity parameter . Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of , or (2) an -fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate , demonstrating that while heterogeneity represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of due to the internal averaging afforded by the batch structure.
Paper Structure (62 sections, 19 theorems, 236 equations, 4 figures, 2 algorithms)

This paper contains 62 sections, 19 theorems, 236 equations, 4 figures, 2 algorithms.

Key Result

theorem 1

Let $\eps <0.1$, $\alpha <0.1$. Under the setup of Problem prob:bounded_mean, there exists a polynomial-time algorithm (Algorithm alg:sos_bounded) that outputs $\wh \mu \in \R^d$ such that $\norm{\mu - \wh \mu}_2 = O\paren{\sqrt{\frac{\eps}{n}} + \sqrt{\alpha}}$ with probability at least $1 - \delta

Figures (4)

  • Figure 1: Mean Shift Model. Illustration of Problem \ref{['prob:bounded_mean']}: each good user provides samples drawn from a distribution whose mean lies within a $\sqrt{\alpha}$-neighborhood of the target mean $\mu$, while an $\eps$-fraction of users (yellow) are fully adversarial and may provide arbitrary samples.
  • Figure 2: Adversarial Model. Illustration of Problem \ref{['prob:arbitrary_alpha']}: beyond an $\eps$-fraction of entirely adversarial users (yellow clusters), each good user’s batch contains an $\alpha$-fraction of adversarially corrupted samples (yellow points within blue clusters), resulting in a two-level corruption model.
  • Figure 3: Lower Bound Construction for $\Omega(\sqrt{\eps/\ns})$. Distributions $H_0$ and $H_1$ differ in their means by $\sqrt{\eps/\ns}$ while maintaining bounded variance.
  • Figure 4: Lower Bound Construction for $\Omega(\sqrt{\alpha})$. Distributions $H_2$ and $H_3$ differ in their means by $\sqrt{\alpha}$ while maintaining bounded variance.

Theorems & Definitions (54)

  • definition 1: Strong contamination model diakonikolas2023algorithmic
  • theorem 1: Mean shift
  • theorem 2: Adversarial
  • remark 1
  • definition 2: Sum-of-squares (SoS) proof
  • definition 3
  • definition 4: Polynomial constraints and satisfiability
  • proof
  • proof
  • proof
  • ...and 44 more