High-Dimensional Robust Mean Estimation with Untrusted Batches

Maryam Aliakbarpour; Vladimir Braverman; Yuhan Liu; Junze Yin

High-Dimensional Robust Mean Estimation with Untrusted Batches

Maryam Aliakbarpour, Vladimir Braverman, Yuhan Liu, Junze Yin

TL;DR

Two Sum-of-Squares based algorithms achieve the minimax-optimal error rate, demonstrating that while heterogeneity $\alpha$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.

Abstract

We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$. In this environment, a learner seeks to recover the mean $μ$ of a true distribution $P$ from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an $\varepsilon$-fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to $P$, but deviate by a proximity parameter $α$. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of $\sqrtα$, or (2) an $α$-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate $O(\sqrt{\varepsilon/n} + \sqrt{d/nN} + \sqrtα)$, demonstrating that while heterogeneity $α$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.

High-Dimensional Robust Mean Estimation with Untrusted Batches

TL;DR

Two Sum-of-Squares based algorithms achieve the minimax-optimal error rate, demonstrating that while heterogeneity

represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of

due to the internal averaging afforded by the batch structure.

Abstract

We study high-dimensional mean estimation in a collaborative setting where data is contributed by

users in batches of size

. In this environment, a learner seeks to recover the mean

of a true distribution

from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an

-fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to

, but deviate by a proximity parameter

. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of

, or (2) an

-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate

, demonstrating that while heterogeneity

represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of

due to the internal averaging afforded by the batch structure.

Paper Structure (62 sections, 19 theorems, 236 equations, 4 figures, 2 algorithms)

This paper contains 62 sections, 19 theorems, 236 equations, 4 figures, 2 algorithms.

Introduction
Problem Setup
Mean Shift Model.
Adversarial Model.
Corruption Model.
Related Work
Robust statistics
Robust estimation from corrupted batches
Other related robust learning models
Organization
Main Results
Unknown corruption
Preliminaries
Notation
Sum-of-squares proof
...and 47 more sections

Key Result

theorem 1

Let $\eps <0.1$, $\alpha <0.1$. Under the setup of Problem prob:bounded_mean, there exists a polynomial-time algorithm (Algorithm alg:sos_bounded) that outputs $\wh \mu \in \R^d$ such that $\norm{\mu - \wh \mu}_2 = O\paren{\sqrt{\frac{\eps}{n}} + \sqrt{\alpha}}$ with probability at least $1 - \delta

Figures (4)

Figure 1: Mean Shift Model. Illustration of Problem \ref{['prob:bounded_mean']}: each good user provides samples drawn from a distribution whose mean lies within a $\sqrt{\alpha}$-neighborhood of the target mean $\mu$, while an $\eps$-fraction of users (yellow) are fully adversarial and may provide arbitrary samples.
Figure 2: Adversarial Model. Illustration of Problem \ref{['prob:arbitrary_alpha']}: beyond an $\eps$-fraction of entirely adversarial users (yellow clusters), each good user’s batch contains an $\alpha$-fraction of adversarially corrupted samples (yellow points within blue clusters), resulting in a two-level corruption model.
Figure 3: Lower Bound Construction for $\Omega(\sqrt{\eps/\ns})$. Distributions $H_0$ and $H_1$ differ in their means by $\sqrt{\eps/\ns}$ while maintaining bounded variance.
Figure 4: Lower Bound Construction for $\Omega(\sqrt{\alpha})$. Distributions $H_2$ and $H_3$ differ in their means by $\sqrt{\alpha}$ while maintaining bounded variance.

Theorems & Definitions (54)

definition 1: Strong contamination model diakonikolas2023algorithmic
theorem 1: Mean shift
theorem 2: Adversarial
remark 1
definition 2: Sum-of-squares (SoS) proof
definition 3
definition 4: Polynomial constraints and satisfiability
proof
proof
proof
...and 44 more

High-Dimensional Robust Mean Estimation with Untrusted Batches

TL;DR

Abstract

High-Dimensional Robust Mean Estimation with Untrusted Batches

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (54)