Robust High-Dimensional Mean Estimation With Low Data Size, an Empirical Study
Cullen Anderson, Jeff M. Phillips
TL;DR
This work addresses robust mean estimation in high dimensions under low data size, a regime where classical theory often demands $n\asymp d$ or larger. It conducts an extensive empirical comparison across many estimators and introduces practical adaptations (notably QUE_low_n with an eigenvalue-threshold refinement) to handle $d\ge n$ scenarios. The study shows that, for Gaussian-like inliers, QUE_low_n nearly matches the best possible inlier mean and often surpasses other robust methods, while real-world embeddings demonstrate reliable performance with early halting; subtractive corruption remains particularly challenging. Overall, the paper highlights the practical value of robust mean estimation under limited data, provides actionable algorithmic adjustments, and motivates further theoretical and empirical exploration beyond Gaussian assumptions.
Abstract
Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the high-dimensional setting.
