Table of Contents
Fetching ...

Sparse Linear Regression is Easy on Random Supports

Gautam Chandrasekaran, Raghu Meka, Konstantinos Stavropoulos

TL;DR

The paper tackles sparse linear regression where $y = X w^* + \xi$ with $w^*$ $k$-sparse, addressing the gap between statistical and computational requirements. It introduces a novel preconditioning framework that, for random supports, yields a polynomial-time algorithm with sample complexity polynomial in $k$, $\log \log \kappa(X)$, and $\log d$, achieving prediction error $\varepsilon$ even when $\kappa(X)$ is large. This is accomplished via Phase 1: constructing a good basis $(B,I)$ that makes the transformed design $Z = X B^T$ well-behaved for sparse regression; and Phase 2: solving a constrained regression on $Z$ (partial Lasso) to recover $\widehat w = B^T \widehat u$, with provable error guarantees. The results extend to random Gaussian designs without knowledge of $\Sigma$ and establish near-tight lower bounds for good preconditioners. Overall, the work shows exponential computational-statistical gaps can vanish under average-case assumptions on the support, advancing practical and theoretical understanding of high-dimensional sparse estimation.

Abstract

Sparse linear regression is one of the most basic questions in machine learning and statistics. Here, we are given as input a design matrix $X \in \mathbb{R}^{N \times d}$ and measurements or labels ${y} \in \mathbb{R}^N$ where ${y} = {X} {w}^* + ξ$, and $ξ$ is the noise in the measurements. Importantly, we have the additional constraint that the unknown signal vector ${w}^*$ is sparse: it has $k$ non-zero entries where $k$ is much smaller than the ambient dimension. Our goal is to output a prediction vector $\widehat{w}$ that has small prediction error: $\frac{1}{N}\cdot \|{X} {w}^* - {X} \widehat{w}\|^2_2$. Information-theoretically, we know what is best possible in terms of measurements: under most natural noise distributions, we can get prediction error at most $ε$ with roughly $N = O(k \log d/ε)$ samples. Computationally, this currently needs $d^{Ω(k)}$ run-time. Alternately, with $N = O(d)$, we can get polynomial-time. Thus, there is an exponential gap (in the dependence on $d$) between the two and we do not know if it is possible to get $d^{o(k)}$ run-time and $o(d)$ samples. We give the first generic positive result for worst-case design matrices ${X}$: For any ${X}$, we show that if the support of ${w}^*$ is chosen at random, we can get prediction error $ε$ with $N = \text{poly}(k, \log d, 1/ε)$ samples and run-time $\text{poly}(d,N)$. This run-time holds for any design matrix ${X}$ with condition number up to $2^{\text{poly}(d)}$. Previously, such results were known for worst-case ${w}^*$, but only for random design matrices from well-behaved families, matrices that have a very low condition number ($\text{poly}(\log d)$; e.g., as studied in compressed sensing), or those with special structural properties.

Sparse Linear Regression is Easy on Random Supports

TL;DR

The paper tackles sparse linear regression where with -sparse, addressing the gap between statistical and computational requirements. It introduces a novel preconditioning framework that, for random supports, yields a polynomial-time algorithm with sample complexity polynomial in , , and , achieving prediction error even when is large. This is accomplished via Phase 1: constructing a good basis that makes the transformed design well-behaved for sparse regression; and Phase 2: solving a constrained regression on (partial Lasso) to recover , with provable error guarantees. The results extend to random Gaussian designs without knowledge of and establish near-tight lower bounds for good preconditioners. Overall, the work shows exponential computational-statistical gaps can vanish under average-case assumptions on the support, advancing practical and theoretical understanding of high-dimensional sparse estimation.

Abstract

Sparse linear regression is one of the most basic questions in machine learning and statistics. Here, we are given as input a design matrix and measurements or labels where , and is the noise in the measurements. Importantly, we have the additional constraint that the unknown signal vector is sparse: it has non-zero entries where is much smaller than the ambient dimension. Our goal is to output a prediction vector that has small prediction error: . Information-theoretically, we know what is best possible in terms of measurements: under most natural noise distributions, we can get prediction error at most with roughly samples. Computationally, this currently needs run-time. Alternately, with , we can get polynomial-time. Thus, there is an exponential gap (in the dependence on ) between the two and we do not know if it is possible to get run-time and samples. We give the first generic positive result for worst-case design matrices : For any , we show that if the support of is chosen at random, we can get prediction error with samples and run-time . This run-time holds for any design matrix with condition number up to . Previously, such results were known for worst-case , but only for random design matrices from well-behaved families, matrices that have a very low condition number (; e.g., as studied in compressed sensing), or those with special structural properties.

Paper Structure

This paper contains 28 sections, 18 theorems, 54 equations, 1 algorithm.

Key Result

Theorem 1.2

There exists a randomized algorithm $\mathcal{A}$ such that for any $\mathbf{X}$ satisfying $\max_{i\in [d]}\|\mathbf{X}^{(i)}\|_2\leq \sqrt{N}$, the following holds: if $S$ is drawn uniformly from $[d]$ with $|S|=k$, ${\mathbf{w}}^*$ is any vector supported on $S$ with $\|\mathbf{X}{\mathbf{w}}^*\| for universal constant $C$. Furthermore, $\mathcal{A}(\mathbf{X},\mathbf{y})$ runs in time $\mathrm

Theorems & Definitions (51)

  • Definition 1.1: $k$-sparse condition number of a dataset
  • Theorem 1.2
  • Remark 1.3
  • Definition 1.4
  • Theorem 1.5
  • Definition 3.1: Good Preconditioner
  • Theorem 3.2: Finding a good preconditioner
  • Remark 3.3: Lower bounds for good preconditioners
  • Theorem 3.4
  • proof : Proof Sketch
  • ...and 41 more