Table of Contents
Fetching ...

Sample-Efficient Linear Regression with Self-Selection Bias

Jason Gaitonde, Elchanan Mossel

TL;DR

This work tackles linear regression with self-selection bias in the unknown-index setting, where the observed response is $z=\max_{j\in[k]}\{\mathbf{x}^T\mathbf{w}_j+\eta_j\}$ and the maximizing index is hidden. The authors develop a novel, near-optimal algorithm that combines a spectral subspace recovery step with a net-based, moment-driven pruning procedure, using low-degree moments conditioned on moderately likely events to distinguish close from far regressors. They prove a comprehensive set of structural results (including a two-threshold separation) and show a sample complexity of $\tilde{O}(n)\cdot\mathrm{poly}(k,1/\varepsilon)$ with time $\mathrm{poly}(n,k,1/\varepsilon)+O((\log k)/\varepsilon)^{O(k)}$, improving over prior StoC results and handling relaxed noise assumptions. For fixed small $k$, the algorithm runs in polynomial time in $n$ and $1/\varepsilon$, making it practical for a broad class of self-selection models, including max-linear regression, with potential to serve as a warm start for local-convergence methods.

Abstract

We consider the problem of linear regression with self-selection bias in the unknown-index setting, as introduced in recent work by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [STOC 2023]. In this model, one observes $m$ i.i.d. samples $(\mathbf{x}_{\ell},z_{\ell})_{\ell=1}^m$ where $z_{\ell}=\max_{i\in [k]}\{\mathbf{x}_{\ell}^T\mathbf{w}_i+η_{i,\ell}\}$, but the maximizing index $i_{\ell}$ is unobserved. Here, the $\mathbf{x}_{\ell}$ are assumed to be $\mathcal{N}(0,I_n)$ and the noise distribution $\mathbfη_{\ell}\sim \mathcal{D}$ is centered and independent of $\mathbf{x}_{\ell}$. We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $\mathbf{w}_1,\ldots,\mathbf{w}_k\in \mathbb{R}^n$ up to additive $\ell_2$-error $\varepsilon$ with polynomial sample complexity $\tilde{O}(n)\cdot \mathsf{poly}(k,1/\varepsilon)$ and significantly improved time complexity $\mathsf{poly}(n,k,1/\varepsilon)+O(\log(k)/\varepsilon)^{O(k)}$. When $k=O(1)$, our algorithm runs in $\mathsf{poly}(n,1/\varepsilon)$ time, generalizing the polynomial guarantee of an explicit moment matching algorithm of Cherapanamjeri, et al. for $k=2$ and when it is known that $\mathcal{D}=\mathcal{N}(0,I_k)$. Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression where the added noise is taken outside the maximum. For this problem, our algorithm is efficient in a much larger range of $k$ than the state-of-the-art due to Ghosh, Pananjady, Guntuboyina, and Ramchandran [IEEE Trans. Inf. Theory 2022] for not too small $\varepsilon$, and leads to improved algorithms for any $\varepsilon$ by providing a warm start for existing local convergence methods.

Sample-Efficient Linear Regression with Self-Selection Bias

TL;DR

This work tackles linear regression with self-selection bias in the unknown-index setting, where the observed response is and the maximizing index is hidden. The authors develop a novel, near-optimal algorithm that combines a spectral subspace recovery step with a net-based, moment-driven pruning procedure, using low-degree moments conditioned on moderately likely events to distinguish close from far regressors. They prove a comprehensive set of structural results (including a two-threshold separation) and show a sample complexity of with time , improving over prior StoC results and handling relaxed noise assumptions. For fixed small , the algorithm runs in polynomial time in and , making it practical for a broad class of self-selection models, including max-linear regression, with potential to serve as a warm start for local-convergence methods.

Abstract

We consider the problem of linear regression with self-selection bias in the unknown-index setting, as introduced in recent work by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [STOC 2023]. In this model, one observes i.i.d. samples where , but the maximizing index is unobserved. Here, the are assumed to be and the noise distribution is centered and independent of . We provide a novel and near optimally sample-efficient (in terms of ) algorithm to recover up to additive -error with polynomial sample complexity and significantly improved time complexity . When , our algorithm runs in time, generalizing the polynomial guarantee of an explicit moment matching algorithm of Cherapanamjeri, et al. for and when it is known that . Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression where the added noise is taken outside the maximum. For this problem, our algorithm is efficient in a much larger range of than the state-of-the-art due to Ghosh, Pananjady, Guntuboyina, and Ramchandran [IEEE Trans. Inf. Theory 2022] for not too small , and leads to improved algorithms for any by providing a warm start for existing local convergence methods.
Paper Structure (17 sections, 39 theorems, 163 equations, 1 algorithm)

This paper contains 17 sections, 39 theorems, 163 equations, 1 algorithm.

Key Result

Theorem 1.1

Under assumption:uncovered and assumption:bounded with fixed $B,\Delta>0$, for any $\varepsilon<\mathsf{poly}(\Delta,1/B)$ and $\lambda\in (0,1)$, there exists an algorithm for eq:lin_reg_ss that outputs $\widetilde{\bm{w}_1},\ldots,\widetilde{\bm{w}_k}\in \mathbb{R}^n$ satisfying $\max_{i\in [k]} \

Theorems & Definitions (69)

  • Theorem 1.1: \ref{['thm:final_statement']}, informal
  • Corollary 1.2
  • Corollary 1.3
  • Proposition 2.1: \ref{['prop:close_maximal']}, restated
  • Corollary 2.2: \ref{['cor:conditional_moments']}, informal
  • Proposition 2.3: \ref{['lem:no_large_proj', 'lem:no_two_close', 'lem:not_all_neg']}, informal
  • Proposition 2.4: \ref{['cor:inflated_sm']}, informal
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • ...and 59 more