Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression

Rares-Darius Buhai; Jingqiu Ding; Stefan Tiegel

Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression

Rares-Darius Buhai, Jingqiu Ding, Stefan Tiegel

TL;DR

The paper investigates computational-statistical gaps for improper learning in sparse linear regression with Gaussian random design, showing that any polynomial-time improper learner likely requires $Ω(k^2)$ samples, far above the information-theoretic $Θ(k \log(d/k))$ samples. The authors build a chain of reductions from a negative-spike sparse PCA (Wishart) problem to sparse linear regression, and bolster the argument with low-degree and statistical-query lower bounds. To handle variance issues, they introduce a paired sparse PCA framework that enforces symmetric noise, enabling a robust reduction to the training-error regime. The results highlight a fundamental k-to-k^2 gap in the random-design setting and situate the findings among concurrent hardness results, while contrasting with known efficient algorithms under restrictive design conditions such as RE or RIP. Overall, the work advances our understanding of when computational constraints prevent achieving information-theoretic performance in improper sparse learning tasks.

Abstract

We study computational-statistical gaps for improper learning in sparse linear regression. More specifically, given $n$ samples from a $k$-sparse linear model in dimension $d$, we ask what is the minimum sample complexity to efficiently (in time polynomial in $d$, $k$, and $n$) find a potentially dense estimate for the regression vector that achieves non-trivial prediction error on the $n$ samples. Information-theoretically this can be achieved using $Θ(k \log (d/k))$ samples. Yet, despite its prominence in the literature, there is no polynomial-time algorithm known to achieve the same guarantees using less than $Θ(d)$ samples without additional restrictions on the model. Similarly, existing hardness results are either restricted to the proper setting, in which the estimate must be sparse as well, or only apply to specific algorithms. We give evidence that efficient algorithms for this task require at least (roughly) $Ω(k^2)$ samples. In particular, we show that an improper learning algorithm for sparse linear regression can be used to solve sparse PCA problems (with a negative spike) in their Wishart form, in regimes in which efficient algorithms are widely believed to require at least $Ω(k^2)$ samples. We complement our reduction with low-degree and statistical query lower bounds for the sparse PCA problems from which we reduce. Our hardness results apply to the (correlated) random design setting in which the covariates are drawn i.i.d. from a mean-zero Gaussian distribution with unknown covariance.

Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression

TL;DR

The paper investigates computational-statistical gaps for improper learning in sparse linear regression with Gaussian random design, showing that any polynomial-time improper learner likely requires

samples, far above the information-theoretic

samples. The authors build a chain of reductions from a negative-spike sparse PCA (Wishart) problem to sparse linear regression, and bolster the argument with low-degree and statistical-query lower bounds. To handle variance issues, they introduce a paired sparse PCA framework that enforces symmetric noise, enabling a robust reduction to the training-error regime. The results highlight a fundamental k-to-k^2 gap in the random-design setting and situate the findings among concurrent hardness results, while contrasting with known efficient algorithms under restrictive design conditions such as RE or RIP. Overall, the work advances our understanding of when computational constraints prevent achieving information-theoretic performance in improper sparse learning tasks.

Abstract

We study computational-statistical gaps for improper learning in sparse linear regression. More specifically, given

samples from a

-sparse linear model in dimension

, we ask what is the minimum sample complexity to efficiently (in time polynomial in

, and

) find a potentially dense estimate for the regression vector that achieves non-trivial prediction error on the

samples. Information-theoretically this can be achieved using

samples. Yet, despite its prominence in the literature, there is no polynomial-time algorithm known to achieve the same guarantees using less than

samples without additional restrictions on the model. Similarly, existing hardness results are either restricted to the proper setting, in which the estimate must be sparse as well, or only apply to specific algorithms. We give evidence that efficient algorithms for this task require at least (roughly)

samples. In particular, we show that an improper learning algorithm for sparse linear regression can be used to solve sparse PCA problems (with a negative spike) in their Wishart form, in regimes in which efficient algorithms are widely believed to require at least

samples. We complement our reduction with low-degree and statistical query lower bounds for the sparse PCA problems from which we reduce. Our hardness results apply to the (correlated) random design setting in which the covariates are drawn i.i.d. from a mean-zero Gaussian distribution with unknown covariance.

Paper Structure (20 sections, 10 theorems, 26 equations)

This paper contains 20 sections, 10 theorems, 26 equations.

Introduction
Known lower bounds and our approach
Results
Relation to known algorithms and other hardness results
Concurrent work
Technical overview
Notation
Hardness of certifying RIP and a first lower bound
Extension to non-degenerate negative sparse PCA
Known variance of the noise
Sparse linear regression reduction
The reduction
Distinguishing paired distributions reduction
Concentration bounds
Low-degree lower bound for negative-spike sparse Wishart model
...and 5 more sections

Key Result

Theorem 1.5

Let $d,n,k \in \varmathbb N$ with $d \geqslant k$ and let $0 < \delta \leqslant 0.1$ be an arbitrary absolute constant. Suppose that $n = o(\min(d,k^{2-\delta}))$. If there is an improper learner for the Sparse Linear Regression Model with Gaussian Design (cf. def:SLR_model_gauss_design) that uses $

Theorems & Definitions (21)

Theorem 1.5: Main result, see reduction in \ref{['thm:main_known_no_RE']}
Definition 2.2: Paired spiked Wishart model
Theorem 3.1
Lemma 3.2
proof : Proof of \ref{['thm:main_known_no_RE']}
Lemma 4.1
proof
Definition B.1: Definition 1.14 in kunisky2019notes
Proposition B.2: hopkins2017efficientHopkinsThesiskunisky2019notes
Conjecture B.3: Informal, HopkinsThesiskunisky2019notes
...and 11 more

Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression

TL;DR

Abstract

Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (21)