Table of Contents
Fetching ...

Robust High-Dimensional Mean Estimation With Low Data Size, an Empirical Study

Cullen Anderson, Jeff M. Phillips

TL;DR

This work addresses robust mean estimation in high dimensions under low data size, a regime where classical theory often demands $n\asymp d$ or larger. It conducts an extensive empirical comparison across many estimators and introduces practical adaptations (notably QUE_low_n with an eigenvalue-threshold refinement) to handle $d\ge n$ scenarios. The study shows that, for Gaussian-like inliers, QUE_low_n nearly matches the best possible inlier mean and often surpasses other robust methods, while real-world embeddings demonstrate reliable performance with early halting; subtractive corruption remains particularly challenging. Overall, the paper highlights the practical value of robust mean estimation under limited data, provides actionable algorithmic adjustments, and motivates further theoretical and empirical exploration beyond Gaussian assumptions.

Abstract

Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the high-dimensional setting.

Robust High-Dimensional Mean Estimation With Low Data Size, an Empirical Study

TL;DR

This work addresses robust mean estimation in high dimensions under low data size, a regime where classical theory often demands or larger. It conducts an extensive empirical comparison across many estimators and introduces practical adaptations (notably QUE_low_n with an eigenvalue-threshold refinement) to handle scenarios. The study shows that, for Gaussian-like inliers, QUE_low_n nearly matches the best possible inlier mean and often surpasses other robust methods, while real-world embeddings demonstrate reliable performance with early halting; subtractive corruption remains particularly challenging. Overall, the paper highlights the practical value of robust mean estimation under limited data, provides actionable algorithmic adjustments, and motivates further theoretical and empirical exploration beyond Gaussian assumptions.

Abstract

Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the high-dimensional setting.

Paper Structure

This paper contains 71 sections, 5 theorems, 11 equations, 73 figures, 1 table.

Key Result

Theorem 1

Let $X$ be a $n \times d$ matrix whose entries are independently drawn from $\mathcal{N}(\mu, I)$. Let $\Sigma = \frac{1}{n}(X-\bar{\mu})^T(X-\bar{\mu})$ be the sample covariance matrix of $X$, where $\bar{\mu} = \frac{1}{n} \sum_i X_i$ and $X_i$ is the $i$th row of $X$. Then for every $t > 0$, with

Figures (73)

  • Figure 1: Uncorrupted Gaussian Identity Covariance
  • Figure 2: Corrupted Gaussian Identity Covariance: Additive Variance Shell Noise
  • Figure 3: Corrupted Gaussian Identity Covariance: DKK Noise
  • Figure 4: Corrupted Gaussian Identity Covariance: Subtractive Noise
  • Figure 5: Corrupted Gaussian Large Spherical Covariance: Additive Variance Shell Noise
  • ...and 68 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Corollary 1.1
  • Theorem 2: vershynin2011randommatrices Thm. 5.35
  • Theorem 3: restatement of Theorem \ref{['thm:Sigma2-bound-main']}
  • proof
  • Corollary 3.1: restatement of Corollary \ref{['cor:prune-2t-main']}
  • proof