A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

Ankit Pensia

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

Ankit Pensia

TL;DR

The paper tackles robust sparse mean estimation under ε-contamination in high dimensions, focusing on achieving subquadratic runtime while preserving poly$(k,\log d, frac{1}{epsilon})$-sample efficiency. The authors develop a subquadratic-time algorithm that leverages fast correlation-detection (Valiant) to avoid forming the full covariance, combined with sparse certificates and randomized filtering to iteratively remove outliers. They also extend the approach to robust sparse PCA, delivering subquadratic-time guarantees with dimension-independent error up to polylogarithmic factors. The work situates itself among prior SDP-based and spectral methods, closing the quadratic-time barrier for sparse robust estimation and pointing to future directions for linear-time robustness in high dimensions. Overall, the results advance computational efficiency in robust high-dimensional inference for structured (sparse) means and PCs, with implications for practical high-dimensional data analysis where outliers are prevalent.

Abstract

We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(μ,\mathbf{I}_d)$, where the unknown mean $μ\in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/ε)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/ε)$, where $ε$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic ($Ω(d^2)$), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/ε)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

TL;DR

The paper tackles robust sparse mean estimation under ε-contamination in high dimensions, focusing on achieving subquadratic runtime while preserving poly

-sample efficiency. The authors develop a subquadratic-time algorithm that leverages fast correlation-detection (Valiant) to avoid forming the full covariance, combined with sparse certificates and randomized filtering to iteratively remove outliers. They also extend the approach to robust sparse PCA, delivering subquadratic-time guarantees with dimension-independent error up to polylogarithmic factors. The work situates itself among prior SDP-based and spectral methods, closing the quadratic-time barrier for sparse robust estimation and pointing to future directions for linear-time robustness in high dimensions. Overall, the results advance computational efficiency in robust high-dimensional inference for structured (sparse) means and PCs, with implications for practical high-dimensional data analysis where outliers are prevalent.

Abstract

We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from

, where the unknown mean

is constrained to be

-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity

and runtime

, where

is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic (

), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size

. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using

samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.

Paper Structure (45 sections, 21 theorems, 31 equations, 8 algorithms)

This paper contains 45 sections, 21 theorems, 31 equations, 8 algorithms.

Introduction
Our Results
Overview of Techniques
(Dense) Robust Mean Estimation
Adapting to Sparsity and Smaller Sample Complexity
Spectral Algorithm of DiaKKPS19
Fast Correlation Detection Algorithm To The Rescue
Challenges in Applying Fast Correlation Detection and A Proposed Fix
Related Work
Robust Sparse Estimation
Fast Algorithms for Robust Estimation
Fast Correlation Detection
Preliminaries
Notation
Deterministic Condition on Inliers
...and 30 more sections

Key Result

Theorem 1.5

Let the contamination rate be $\epsilon \in (0,\epsilon_0)$ for a small constant $\epsilon_0 \in (0,1/2)$ and $k \in \mathbb N$ be the sparsity. Let $T$ be an $\epsilon$-corrupted set of $n$ samples from $\mathcal{N}(\mu,\mathbf{I}_d)$ for an unknown $k$-sparse $\mu \in \mathbb R^d$. Then there is a

Theorems & Definitions (50)

Definition 1.1: Strong Contamination Model
Theorem 1.5: Robust Sparse Mean Estimation in Subquadratic Time
Theorem 1.6: Robust Sparse PCA in Subquadratic Time
Proposition 2.1: Sparse estimation using $\|\cdot\|_{2,k}$ norm CheDKGGS21
Definition 2.2: Projection of Pairs of Coordinates
Definition 2.3: Stability
Lemma 2.4: Stability Sample Complexity CheDKGGS21
Theorem 2.5: Guarantee of \ref{['alg:randomized_filtering']}; DiaKan22-book
Lemma 2.6: Sparse Certificate Lemma, see, e.g., BalDLS17
Lemma 2.6: Sparse Filtering Lemma
...and 40 more

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

TL;DR

Abstract

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (50)