Entangled Mean Estimation in High-Dimensions

Ilias Diakonikolas; Daniel M. Kane; Sihan Liu; Thanasis Pittas

Entangled Mean Estimation in High-Dimensions

Ilias Diakonikolas, Daniel M. Kane, Sihan Liu, Thanasis Pittas

TL;DR

This work resolves the challenging problem of high-dimensional entangled mean estimation under the subset-of-signals model by delivering a computationally efficient algorithm that nearly matches the information-theoretic limit. It combines a warm-start tournament, rejection sampling to filter noisy samples, and a recursive dimensionality-reduction scheme that identifies low-variance subspaces and progressively refines the mean estimate. The main result shows that the estimation error decomposes into a one-dimensional term $f(\alpha,N)$ and a sub-Gaussian term $\sqrt{D/(\alpha N)}$, up to polylogarithmic factors, and that the algorithm runs in polynomial time in $D$ and $N$ provided $N \ge \widetilde{\Omega}(D/\alpha)$. This advances our understanding of multivariate entangled mean estimation by matching the 1D lower bounds up to polylog factors and paves the way for robust, scalable estimation in heterogeneous Gaussian settings with bounded subset covariances.

Abstract

We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. Specifically, given $N$ independent random points $x_1,\ldots,x_N$ in $\mathbb{R}^D$ and a parameter $α\in (0, 1)$ such that each $x_i$ is drawn from a Gaussian with mean $μ$ and unknown covariance, and an unknown $α$-fraction of the points have identity-bounded covariances, the goal is to estimate the common mean $μ$. The one-dimensional version of this task has received significant attention in theoretical computer science and statistics over the past decades. Recent work [LY20; CV24] has given near-optimal upper and lower bounds for the one-dimensional setting. On the other hand, our understanding of even the information-theoretic aspects of the multivariate setting has remained limited. In this work, we design a computationally efficient algorithm achieving an information-theoretically near-optimal error. Specifically, we show that the optimal error (up to polylogarithmic factors) is $f(α,N) + \sqrt{D/(αN)}$, where the term $f(α,N)$ is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate. Our algorithmic approach employs an iterative refinement strategy, whereby we progressively learn more accurate approximations $\hat μ$ to $μ$. This is achieved via a novel rejection sampling procedure that removes points significantly deviating from $\hat μ$, as an attempt to filter out unusually noisy samples. A complication that arises is that rejection sampling introduces bias in the distribution of the remaining points. To address this issue, we perform a careful analysis of the bias, develop an iterative dimension-reduction strategy, and employ a novel subroutine inspired by list-decodable learning that leverages the one-dimensional result.

Entangled Mean Estimation in High-Dimensions

TL;DR

and a sub-Gaussian term

, up to polylogarithmic factors, and that the algorithm runs in polynomial time in

and

provided

. This advances our understanding of multivariate entangled mean estimation by matching the 1D lower bounds up to polylog factors and paves the way for robust, scalable estimation in heterogeneous Gaussian settings with bounded subset covariances.

Abstract

We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. Specifically, given

independent random points

and a parameter

such that each

is drawn from a Gaussian with mean

and unknown covariance, and an unknown

-fraction of the points have identity-bounded covariances, the goal is to estimate the common mean

. The one-dimensional version of this task has received significant attention in theoretical computer science and statistics over the past decades. Recent work [LY20; CV24] has given near-optimal upper and lower bounds for the one-dimensional setting. On the other hand, our understanding of even the information-theoretic aspects of the multivariate setting has remained limited. In this work, we design a computationally efficient algorithm achieving an information-theoretically near-optimal error. Specifically, we show that the optimal error (up to polylogarithmic factors) is

, where the term

is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate. Our algorithmic approach employs an iterative refinement strategy, whereby we progressively learn more accurate approximations

. This is achieved via a novel rejection sampling procedure that removes points significantly deviating from

, as an attempt to filter out unusually noisy samples. A complication that arises is that rejection sampling introduces bias in the distribution of the remaining points. To address this issue, we perform a careful analysis of the bias, develop an iterative dimension-reduction strategy, and employ a novel subroutine inspired by list-decodable learning that leverages the one-dimensional result.

Paper Structure (33 sections, 25 theorems, 98 equations, 4 algorithms)

This paper contains 33 sections, 25 theorems, 98 equations, 4 algorithms.

Introduction
Main Result
Brief Overview of Techniques
Related Work
Additional Related Work on Entangled Mean Estimation
Comparison of Optimal Error in Spherical vs Arbitrary Gaussians
Further Related Work
Robust Statistics
Other models of semi-oblivious adversaries
The Dimensionality Reduction Algorithm and Proof Roadmap
Warm-start Estimate via Tournament
Rejection Sampling
Distribution of Accepted Samples
Rejection of Noisy Samples
Survival of Samples with Bounded Covariance
...and 18 more sections

Key Result

Theorem 1.1

EntangledMeanEstimation$N$ in alg:mean_estimation satisfies the following guarantee: The algorithm draws $N$ samples in $\mathbb R^D$ from the subset-of-signals model of def:model with common mean $\mu \in \mathbb R^D$ and signal-to-noise rate $\alpha \in (0,1)$. If $N \geq \tfrac{D}{\alpha} \log^C( where $f(\cdot)$ is the function defined in eq:simple-f-def. Moreover, the algorithm runs in time $

Theorems & Definitions (54)

Definition 1.0: Subset-of-Signals Model For High-Dimensional Gaussians
Theorem 1.1: High-Dimensional Entangled Mean Estimation
Lemma 2.1: Iterative Refinement (Informal; see \ref{['lem:single_stage_analysis']})
Lemma 2.2: Tournament (Informal; see \ref{['lem:prune']})
Definition 2.3: Generation of accepted samples---alternative view
Lemma 2.4: Low-Variance Subspace Identification (Informal; see \ref{['lem:low-var-identification']})
Remark 3.4: Covariance Eigenvalue Lower Bound
Definition 3.4: Data-generation model; independent batches
Lemma 3.4
Theorem 3.5: compton2024near
...and 44 more

Entangled Mean Estimation in High-Dimensions

TL;DR

Abstract

Entangled Mean Estimation in High-Dimensions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (54)