Table of Contents
Fetching ...

Simple and Optimal Sublinear Algorithms for Mean Estimation

Beatrice Bertolotti, Matteo Russo, Chris Schwiegelshohn, Sudarshan Shyam

TL;DR

This work addresses sublinear, high-dimensional mean estimation by proving that a $(1+\varepsilon)$-approximate mean can be obtained with $O(\varepsilon^{-1}\log \delta^{-1})$ samples and developing three optimal-sample estimators. It introduces coordinate-wise median-of-means, geometric-median-of-means with a fast gradient-descent solver, and an order-statistics based MinSumSelect, each with near-optimal running times in $d$ and $\log \delta^{-1}$. The paper provides rigorous generalization bounds and matching lower bounds, plus an extensive empirical study showing the geometric-median-of-means approach is often the most competitive in practice, with MinSumSelect and CoordWiseMedian offering strong speed. Together, these results establish tight sample-complexity limits for sublinear mean estimation and offer practical, scalable methods for high-dimensional data analysis.

Abstract

We study the sublinear multivariate mean estimation problem in $d$-dimensional Euclidean space. Specifically, we aim to find the mean $μ$ of a ground point set $A$, which minimizes the sum of squared Euclidean distances of the points in $A$ to $μ$. We first show that a multiplicative $(1+\varepsilon)$ approximation to $μ$ can be found with probability $1-δ$ using $O(\varepsilon^{-1}\log δ^{-1})$ many independent uniform random samples, and provide a matching lower bound. Furthermore, we give two estimators with optimal sample complexity that can be computed in optimal running time for extracting a suitable approximate mean: 1. The coordinate-wise median of $\log δ^{-1}$ sample means of sample size $\varepsilon^{-1}$. As a corollary, we also show improved convergence rates for this estimator for estimating means of multivariate distributions. 2. The geometric median of $\log δ^{-1}$ sample means of sample size $\varepsilon^{-1}$. To compute a solution efficiently, we design a novel and simple gradient descent algorithm that is significantly faster for our specific setting than all other known algorithms for computing geometric medians. In addition, we propose an order statistics approach that is empirically competitive with these algorithms, has an optimal sample complexity and matches the running time up to lower order terms. We finally provide an extensive experimental evaluation among several estimators which concludes that the geometric-median-of-means-based approach is typically the most competitive in practice.

Simple and Optimal Sublinear Algorithms for Mean Estimation

TL;DR

This work addresses sublinear, high-dimensional mean estimation by proving that a -approximate mean can be obtained with samples and developing three optimal-sample estimators. It introduces coordinate-wise median-of-means, geometric-median-of-means with a fast gradient-descent solver, and an order-statistics based MinSumSelect, each with near-optimal running times in and . The paper provides rigorous generalization bounds and matching lower bounds, plus an extensive empirical study showing the geometric-median-of-means approach is often the most competitive in practice, with MinSumSelect and CoordWiseMedian offering strong speed. Together, these results establish tight sample-complexity limits for sublinear mean estimation and offer practical, scalable methods for high-dimensional data analysis.

Abstract

We study the sublinear multivariate mean estimation problem in -dimensional Euclidean space. Specifically, we aim to find the mean of a ground point set , which minimizes the sum of squared Euclidean distances of the points in to . We first show that a multiplicative approximation to can be found with probability using many independent uniform random samples, and provide a matching lower bound. Furthermore, we give two estimators with optimal sample complexity that can be computed in optimal running time for extracting a suitable approximate mean: 1. The coordinate-wise median of sample means of sample size . As a corollary, we also show improved convergence rates for this estimator for estimating means of multivariate distributions. 2. The geometric median of sample means of sample size . To compute a solution efficiently, we design a novel and simple gradient descent algorithm that is significantly faster for our specific setting than all other known algorithms for computing geometric medians. In addition, we propose an order statistics approach that is empirically competitive with these algorithms, has an optimal sample complexity and matches the running time up to lower order terms. We finally provide an extensive experimental evaluation among several estimators which concludes that the geometric-median-of-means-based approach is typically the most competitive in practice.
Paper Structure (26 sections, 18 theorems, 33 equations, 7 figures, 1 table, 4 algorithms)

This paper contains 26 sections, 18 theorems, 33 equations, 7 figures, 1 table, 4 algorithms.

Key Result

Lemma 2.1

$\mathbb{E}\left[\|\mu(A)-\hat{\mu}(S)\|^2 \right] = \frac{1}{|S|}\cdot \frac{\textup{Opt}}{n}$.

Figures (7)

  • Figure 1: Good empirical means are represented in green and bad ones in red. The ball of good means centered at $\mu$ has radius $r$. Projection of all the good means lie on the bounded line segment of length at most $2r$.
  • Figure 2: MNIST Dataset: Accuracy and runtime against sample size.
  • Figure 3: Fashion-MNIST Dataset: Accuracy and runtime against sample size.
  • Figure 4: CoverType Dataset: Accuracy and runtime against sample size.
  • Figure 5: MNIST, Fashion-MNIST and CoverType Datasets: Accuracy against sample size (in linear scale).
  • ...and 2 more figures

Theorems & Definitions (35)

  • Lemma 2.1: Lemma 1 of InabaKI94
  • Lemma 2.2: High Dimensional Mean-Variance Decomposition
  • proof
  • Lemma 2.3
  • proof
  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Proposition 3.3
  • ...and 25 more