Table of Contents
Fetching ...

Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

Ilias Diakonikolas, Giannis Iakovidis, Daniel M. Kane, Sihan Liu

TL;DR

It is shown that, under mild spectral conditions on the characteristic function of the (potentially multivariate) base distribution, there exists a sample-efficient algorithm that estimates the target mean to any desired accuracy.

Abstract

We study the basic task of mean estimation in the presence of mean-shift contamination. In the mean-shift contamination model, an adversary is allowed to replace a small constant fraction of the clean samples by samples drawn from arbitrarily shifted versions of the base distribution. Prior work characterized the sample complexity of this task for the special cases of the Gaussian and Laplace distributions. Specifically, it was shown that consistent estimation is possible in these cases, a property that is provably impossible in Huber's contamination model. An open question posed in earlier work was to determine the sample complexity of mean estimation in the mean-shift contamination model for general base distributions. In this work, we study and essentially resolve this open question. Specifically, we show that, under mild spectral conditions on the characteristic function of the (potentially multivariate) base distribution, there exists a sample-efficient algorithm that estimates the target mean to any desired accuracy. We complement our upper bound with a qualitatively matching sample complexity lower bound. Our techniques make critical use of Fourier analysis, and in particular introduce the notion of a Fourier witness as an essential ingredient of our upper and lower bounds.

Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

TL;DR

It is shown that, under mild spectral conditions on the characteristic function of the (potentially multivariate) base distribution, there exists a sample-efficient algorithm that estimates the target mean to any desired accuracy.

Abstract

We study the basic task of mean estimation in the presence of mean-shift contamination. In the mean-shift contamination model, an adversary is allowed to replace a small constant fraction of the clean samples by samples drawn from arbitrarily shifted versions of the base distribution. Prior work characterized the sample complexity of this task for the special cases of the Gaussian and Laplace distributions. Specifically, it was shown that consistent estimation is possible in these cases, a property that is provably impossible in Huber's contamination model. An open question posed in earlier work was to determine the sample complexity of mean estimation in the mean-shift contamination model for general base distributions. In this work, we study and essentially resolve this open question. Specifically, we show that, under mild spectral conditions on the characteristic function of the (potentially multivariate) base distribution, there exists a sample-efficient algorithm that estimates the target mean to any desired accuracy. We complement our upper bound with a qualitatively matching sample complexity lower bound. Our techniques make critical use of Fourier analysis, and in particular introduce the notion of a Fourier witness as an essential ingredient of our upper and lower bounds.
Paper Structure (15 sections, 15 theorems, 103 equations, 1 table, 1 algorithm)

This paper contains 15 sections, 15 theorems, 103 equations, 1 table, 1 algorithm.

Key Result

Theorem 1.2

Let $D$ be a distribution over $\mathbb{R}^d$ with characteristic function $\phi_D$, $\alpha \in (0,1/2)$ be the contamination parameter, and $\epsilon$ be the target error. Then, under mild technical assumptions on $D$, there exists an algorithm that estimates the mean of $D_\mu$ from $\widetilde{O

Theorems & Definitions (55)

  • Definition 1.1: Mean-Shift Contamination Model
  • Theorem 1.2: Informal Main Result
  • Remark 1.3: Consistency
  • Definition 1.4: Characteristic function
  • Definition 3.1: Frequency-witness condition
  • Theorem 3.2: Upper bound via frequency witnesses
  • proof : Proof of \ref{['thm:upper']}
  • Claim 3.3: Every frequency witness has bounded norm
  • Claim 3.4: Distant candidates have a large $T_{\widehat{\mu}}$
  • Claim 3.5: Close candidates have small $T_{\widehat{\mu}}$
  • ...and 45 more