High-dimensional estimation with missing data: Statistical and computational limits

Kabir Aladin Verchand; Ankit Pensia; Saminul Haque; Rohith Kuditipudi

High-dimensional estimation with missing data: Statistical and computational limits

Kabir Aladin Verchand, Ankit Pensia, Saminul Haque, Rohith Kuditipudi

Abstract

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

High-dimensional estimation with missing data: Statistical and computational limits

Abstract

fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in

norm, we show that in order to obtain error at most

, for any constant contamination

, (roughly)

samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly)

and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

Paper Structure (92 sections, 53 theorems, 403 equations, 2 figures, 2 algorithms)

This paper contains 92 sections, 53 theorems, 403 equations, 2 figures, 2 algorithms.

Introduction
Contributions
Mean estimation.
Covariance estimation.
Linear regression.
Further related work
Missing data.
Truncated statistics.
Robust estimation.
Background
Sum-of-squares algorithms
Quantifier elimination.
SQ lower bounds
Minimax lower bounds
Information-theoretic limits of mean and covariance estimation
...and 77 more sections

Key Result

Lemma 1.1

The distribution $Q \in \mathcal{R}(P, \epsilon, q)$ if and only if, for all $z \in \mathbf{R}^d$,

Figures (2)

Figure 1: Types of missing data patterns. Each row indicates a single sample, where the entries in gray indicate an observed value and the $\star$ entries indicate missingness. In order to simplify the results in the main text, we focus on the all-or-nothing patterns described in panel (a), deferring the extension to the more general setting to Appendix \ref{['sec:multiple-patterns']}.
Figure 2: Sample complexity phase diagram for mean estimation with $\epsilon$ a fixed constant. In order to achieve $\ell_2$ norm error $\rho$, it is information-theoretically necessary and sufficient to take $n \asymp de^{1/\rho^2}$ many samples. On the other hand, any statistical query algorithm must take (roughly) $d^{1/\rho^2}$ many samples and a polynomial time algorithm (nearly) saturates this lower bound.

Theorems & Definitions (106)

Lemma 1.1
Lemma 1.2
Definition 2.1
Definition 2.2
Definition 2.5: STAT Oracle
Definition 2.6: Generic Testing Problem
Definition 2.7: High-Dimensional Hidden Direction Distribution
Definition 2.8: NGCA
Lemma 2.9: SQ Lower Bounds for NGCA DiaKS17
Lemma 2.10: Fano's inequality
...and 96 more

High-dimensional estimation with missing data: Statistical and computational limits

Abstract

High-dimensional estimation with missing data: Statistical and computational limits

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (106)