Table of Contents
Fetching ...

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

Ankit Pensia

TL;DR

The paper tackles robust sparse mean estimation under ε-contamination in high dimensions, focusing on achieving subquadratic runtime while preserving poly$(k,\log d, frac{1}{epsilon})$-sample efficiency. The authors develop a subquadratic-time algorithm that leverages fast correlation-detection (Valiant) to avoid forming the full covariance, combined with sparse certificates and randomized filtering to iteratively remove outliers. They also extend the approach to robust sparse PCA, delivering subquadratic-time guarantees with dimension-independent error up to polylogarithmic factors. The work situates itself among prior SDP-based and spectral methods, closing the quadratic-time barrier for sparse robust estimation and pointing to future directions for linear-time robustness in high dimensions. Overall, the results advance computational efficiency in robust high-dimensional inference for structured (sparse) means and PCs, with implications for practical high-dimensional data analysis where outliers are prevalent.

Abstract

We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(μ,\mathbf{I}_d)$, where the unknown mean $μ\in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/ε)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/ε)$, where $ε$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic ($Ω(d^2)$), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/ε)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

TL;DR

The paper tackles robust sparse mean estimation under ε-contamination in high dimensions, focusing on achieving subquadratic runtime while preserving poly-sample efficiency. The authors develop a subquadratic-time algorithm that leverages fast correlation-detection (Valiant) to avoid forming the full covariance, combined with sparse certificates and randomized filtering to iteratively remove outliers. They also extend the approach to robust sparse PCA, delivering subquadratic-time guarantees with dimension-independent error up to polylogarithmic factors. The work situates itself among prior SDP-based and spectral methods, closing the quadratic-time barrier for sparse robust estimation and pointing to future directions for linear-time robustness in high dimensions. Overall, the results advance computational efficiency in robust high-dimensional inference for structured (sparse) means and PCs, with implications for practical high-dimensional data analysis where outliers are prevalent.

Abstract

We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from , where the unknown mean is constrained to be -sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity and runtime , where is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic (), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size . Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.
Paper Structure (45 sections, 21 theorems, 31 equations, 8 algorithms)

This paper contains 45 sections, 21 theorems, 31 equations, 8 algorithms.

Key Result

Theorem 1.5

Let the contamination rate be $\epsilon \in (0,\epsilon_0)$ for a small constant $\epsilon_0 \in (0,1/2)$ and $k \in \mathbb N$ be the sparsity. Let $T$ be an $\epsilon$-corrupted set of $n$ samples from $\mathcal{N}(\mu,\mathbf{I}_d)$ for an unknown $k$-sparse $\mu \in \mathbb R^d$. Then there is a

Theorems & Definitions (50)

  • Definition 1.1: Strong Contamination Model
  • Theorem 1.5: Robust Sparse Mean Estimation in Subquadratic Time
  • Theorem 1.6: Robust Sparse PCA in Subquadratic Time
  • Proposition 2.1: Sparse estimation using $\|\cdot\|_{2,k}$ norm CheDKGGS21
  • Definition 2.2: Projection of Pairs of Coordinates
  • Definition 2.3: Stability
  • Lemma 2.4: Stability Sample Complexity CheDKGGS21
  • Theorem 2.5: Guarantee of \ref{['alg:randomized_filtering']}; DiaKan22-book
  • Lemma 2.6: Sparse Certificate Lemma, see, e.g., BalDLS17
  • Lemma 2.6: Sparse Filtering Lemma
  • ...and 40 more