Sample-Efficient Linear Regression with Self-Selection Bias

Jason Gaitonde; Elchanan Mossel

Sample-Efficient Linear Regression with Self-Selection Bias

Jason Gaitonde, Elchanan Mossel

TL;DR

This work tackles linear regression with self-selection bias in the unknown-index setting, where the observed response is $z=\max_{j\in[k]}\{\mathbf{x}^T\mathbf{w}_j+\eta_j\}$ and the maximizing index is hidden. The authors develop a novel, near-optimal algorithm that combines a spectral subspace recovery step with a net-based, moment-driven pruning procedure, using low-degree moments conditioned on moderately likely events to distinguish close from far regressors. They prove a comprehensive set of structural results (including a two-threshold separation) and show a sample complexity of $\tilde{O}(n)\cdot\mathrm{poly}(k,1/\varepsilon)$ with time $\mathrm{poly}(n,k,1/\varepsilon)+O((\log k)/\varepsilon)^{O(k)}$, improving over prior StoC results and handling relaxed noise assumptions. For fixed small $k$, the algorithm runs in polynomial time in $n$ and $1/\varepsilon$, making it practical for a broad class of self-selection models, including max-linear regression, with potential to serve as a warm start for local-convergence methods.

Abstract

We consider the problem of linear regression with self-selection bias in the unknown-index setting, as introduced in recent work by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [STOC 2023]. In this model, one observes $m$ i.i.d. samples $(\mathbf{x}_{\ell},z_{\ell})_{\ell=1}^m$ where $z_{\ell}=\max_{i\in [k]}\{\mathbf{x}_{\ell}^T\mathbf{w}_i+η_{i,\ell}\}$, but the maximizing index $i_{\ell}$ is unobserved. Here, the $\mathbf{x}_{\ell}$ are assumed to be $\mathcal{N}(0,I_n)$ and the noise distribution $\mathbfη_{\ell}\sim \mathcal{D}$ is centered and independent of $\mathbf{x}_{\ell}$. We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $\mathbf{w}_1,\ldots,\mathbf{w}_k\in \mathbb{R}^n$ up to additive $\ell_2$-error $\varepsilon$ with polynomial sample complexity $\tilde{O}(n)\cdot \mathsf{poly}(k,1/\varepsilon)$ and significantly improved time complexity $\mathsf{poly}(n,k,1/\varepsilon)+O(\log(k)/\varepsilon)^{O(k)}$. When $k=O(1)$, our algorithm runs in $\mathsf{poly}(n,1/\varepsilon)$ time, generalizing the polynomial guarantee of an explicit moment matching algorithm of Cherapanamjeri, et al. for $k=2$ and when it is known that $\mathcal{D}=\mathcal{N}(0,I_k)$. Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression where the added noise is taken outside the maximum. For this problem, our algorithm is efficient in a much larger range of $k$ than the state-of-the-art due to Ghosh, Pananjady, Guntuboyina, and Ramchandran [IEEE Trans. Inf. Theory 2022] for not too small $\varepsilon$, and leads to improved algorithms for any $\varepsilon$ by providing a warm start for existing local convergence methods.

Sample-Efficient Linear Regression with Self-Selection Bias

TL;DR

This work tackles linear regression with self-selection bias in the unknown-index setting, where the observed response is

and the maximizing index is hidden. The authors develop a novel, near-optimal algorithm that combines a spectral subspace recovery step with a net-based, moment-driven pruning procedure, using low-degree moments conditioned on moderately likely events to distinguish close from far regressors. They prove a comprehensive set of structural results (including a two-threshold separation) and show a sample complexity of

with time

, improving over prior StoC results and handling relaxed noise assumptions. For fixed small

, the algorithm runs in polynomial time in

and

, making it practical for a broad class of self-selection models, including max-linear regression, with potential to serve as a warm start for local-convergence methods.

Abstract

i.i.d. samples

where

, but the maximizing index

is unobserved. Here, the

are assumed to be

and the noise distribution

is centered and independent of

. We provide a novel and near optimally sample-efficient (in terms of

) algorithm to recover

up to additive

-error

with polynomial sample complexity

and significantly improved time complexity

. When

, our algorithm runs in

time, generalizing the polynomial guarantee of an explicit moment matching algorithm of Cherapanamjeri, et al. for

and when it is known that

. Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression where the added noise is taken outside the maximum. For this problem, our algorithm is efficient in a much larger range of

than the state-of-the-art due to Ghosh, Pananjady, Guntuboyina, and Ramchandran [IEEE Trans. Inf. Theory 2022] for not too small

, and leads to improved algorithms for any

by providing a warm start for existing local convergence methods.

Paper Structure (17 sections, 39 theorems, 163 equations, 1 algorithm)

This paper contains 17 sections, 39 theorems, 163 equations, 1 algorithm.

Introduction
Problem Formulation
Our Contributions
Related Work
Overview of Techniques
Preliminaries
Linear Algebra
Subgaussian and Subexponential Random Variables
Concentration Bounds
The Geometry and Moments of Regressors
The Geometry of Near Vectors
The Geometry of Far Vectors
Distinguishing Close and Far Vectors
Finding an Approximate Subspace
Complete Algorithm and Sample Complexity Analysis
...and 2 more sections

Key Result

Theorem 1.1

Under assumption:uncovered and assumption:bounded with fixed $B,\Delta>0$, for any $\varepsilon<\mathsf{poly}(\Delta,1/B)$ and $\lambda\in (0,1)$, there exists an algorithm for eq:lin_reg_ss that outputs $\widetilde{\bm{w}_1},\ldots,\widetilde{\bm{w}_k}\in \mathbb{R}^n$ satisfying $\max_{i\in [k]} \

Theorems & Definitions (69)

Theorem 1.1: \ref{['thm:final_statement']}, informal
Corollary 1.2
Corollary 1.3
Proposition 2.1: \ref{['prop:close_maximal']}, restated
Corollary 2.2: \ref{['cor:conditional_moments']}, informal
Proposition 2.3: \ref{['lem:no_large_proj', 'lem:no_two_close', 'lem:not_all_neg']}, informal
Proposition 2.4: \ref{['cor:inflated_sm']}, informal
Lemma 3.1
proof
Lemma 3.2
...and 59 more

Sample-Efficient Linear Regression with Self-Selection Bias

TL;DR

Abstract

Sample-Efficient Linear Regression with Self-Selection Bias

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (69)