Table of Contents
Fetching ...

Entangled Mean Estimation in High-Dimensions

Ilias Diakonikolas, Daniel M. Kane, Sihan Liu, Thanasis Pittas

TL;DR

This work resolves the challenging problem of high-dimensional entangled mean estimation under the subset-of-signals model by delivering a computationally efficient algorithm that nearly matches the information-theoretic limit. It combines a warm-start tournament, rejection sampling to filter noisy samples, and a recursive dimensionality-reduction scheme that identifies low-variance subspaces and progressively refines the mean estimate. The main result shows that the estimation error decomposes into a one-dimensional term $f(\alpha,N)$ and a sub-Gaussian term $\sqrt{D/(\alpha N)}$, up to polylogarithmic factors, and that the algorithm runs in polynomial time in $D$ and $N$ provided $N \ge \widetilde{\Omega}(D/\alpha)$. This advances our understanding of multivariate entangled mean estimation by matching the 1D lower bounds up to polylog factors and paves the way for robust, scalable estimation in heterogeneous Gaussian settings with bounded subset covariances.

Abstract

We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. Specifically, given $N$ independent random points $x_1,\ldots,x_N$ in $\mathbb{R}^D$ and a parameter $α\in (0, 1)$ such that each $x_i$ is drawn from a Gaussian with mean $μ$ and unknown covariance, and an unknown $α$-fraction of the points have identity-bounded covariances, the goal is to estimate the common mean $μ$. The one-dimensional version of this task has received significant attention in theoretical computer science and statistics over the past decades. Recent work [LY20; CV24] has given near-optimal upper and lower bounds for the one-dimensional setting. On the other hand, our understanding of even the information-theoretic aspects of the multivariate setting has remained limited. In this work, we design a computationally efficient algorithm achieving an information-theoretically near-optimal error. Specifically, we show that the optimal error (up to polylogarithmic factors) is $f(α,N) + \sqrt{D/(αN)}$, where the term $f(α,N)$ is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate. Our algorithmic approach employs an iterative refinement strategy, whereby we progressively learn more accurate approximations $\hat μ$ to $μ$. This is achieved via a novel rejection sampling procedure that removes points significantly deviating from $\hat μ$, as an attempt to filter out unusually noisy samples. A complication that arises is that rejection sampling introduces bias in the distribution of the remaining points. To address this issue, we perform a careful analysis of the bias, develop an iterative dimension-reduction strategy, and employ a novel subroutine inspired by list-decodable learning that leverages the one-dimensional result.

Entangled Mean Estimation in High-Dimensions

TL;DR

This work resolves the challenging problem of high-dimensional entangled mean estimation under the subset-of-signals model by delivering a computationally efficient algorithm that nearly matches the information-theoretic limit. It combines a warm-start tournament, rejection sampling to filter noisy samples, and a recursive dimensionality-reduction scheme that identifies low-variance subspaces and progressively refines the mean estimate. The main result shows that the estimation error decomposes into a one-dimensional term and a sub-Gaussian term , up to polylogarithmic factors, and that the algorithm runs in polynomial time in and provided . This advances our understanding of multivariate entangled mean estimation by matching the 1D lower bounds up to polylog factors and paves the way for robust, scalable estimation in heterogeneous Gaussian settings with bounded subset covariances.

Abstract

We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. Specifically, given independent random points in and a parameter such that each is drawn from a Gaussian with mean and unknown covariance, and an unknown -fraction of the points have identity-bounded covariances, the goal is to estimate the common mean . The one-dimensional version of this task has received significant attention in theoretical computer science and statistics over the past decades. Recent work [LY20; CV24] has given near-optimal upper and lower bounds for the one-dimensional setting. On the other hand, our understanding of even the information-theoretic aspects of the multivariate setting has remained limited. In this work, we design a computationally efficient algorithm achieving an information-theoretically near-optimal error. Specifically, we show that the optimal error (up to polylogarithmic factors) is , where the term is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate. Our algorithmic approach employs an iterative refinement strategy, whereby we progressively learn more accurate approximations to . This is achieved via a novel rejection sampling procedure that removes points significantly deviating from , as an attempt to filter out unusually noisy samples. A complication that arises is that rejection sampling introduces bias in the distribution of the remaining points. To address this issue, we perform a careful analysis of the bias, develop an iterative dimension-reduction strategy, and employ a novel subroutine inspired by list-decodable learning that leverages the one-dimensional result.
Paper Structure (33 sections, 25 theorems, 98 equations, 4 algorithms)

This paper contains 33 sections, 25 theorems, 98 equations, 4 algorithms.

Key Result

Theorem 1.1

EntangledMeanEstimation$N$ in alg:mean_estimation satisfies the following guarantee: The algorithm draws $N$ samples in $\mathbb R^D$ from the subset-of-signals model of def:model with common mean $\mu \in \mathbb R^D$ and signal-to-noise rate $\alpha \in (0,1)$. If $N \geq \tfrac{D}{\alpha} \log^C( where $f(\cdot)$ is the function defined in eq:simple-f-def. Moreover, the algorithm runs in time $

Theorems & Definitions (54)

  • Definition 1.0: Subset-of-Signals Model For High-Dimensional Gaussians
  • Theorem 1.1: High-Dimensional Entangled Mean Estimation
  • Lemma 2.1: Iterative Refinement (Informal; see \ref{['lem:single_stage_analysis']})
  • Lemma 2.2: Tournament (Informal; see \ref{['lem:prune']})
  • Definition 2.3: Generation of accepted samples---alternative view
  • Lemma 2.4: Low-Variance Subspace Identification (Informal; see \ref{['lem:low-var-identification']})
  • Remark 3.4: Covariance Eigenvalue Lower Bound
  • Definition 3.4: Data-generation model; independent batches
  • Lemma 3.4
  • Theorem 3.5: compton2024near
  • ...and 44 more