Sample-Efficient Linear Regression with Self-Selection Bias
Jason Gaitonde, Elchanan Mossel
TL;DR
This work tackles linear regression with self-selection bias in the unknown-index setting, where the observed response is $z=\max_{j\in[k]}\{\mathbf{x}^T\mathbf{w}_j+\eta_j\}$ and the maximizing index is hidden. The authors develop a novel, near-optimal algorithm that combines a spectral subspace recovery step with a net-based, moment-driven pruning procedure, using low-degree moments conditioned on moderately likely events to distinguish close from far regressors. They prove a comprehensive set of structural results (including a two-threshold separation) and show a sample complexity of $\tilde{O}(n)\cdot\mathrm{poly}(k,1/\varepsilon)$ with time $\mathrm{poly}(n,k,1/\varepsilon)+O((\log k)/\varepsilon)^{O(k)}$, improving over prior StoC results and handling relaxed noise assumptions. For fixed small $k$, the algorithm runs in polynomial time in $n$ and $1/\varepsilon$, making it practical for a broad class of self-selection models, including max-linear regression, with potential to serve as a warm start for local-convergence methods.
Abstract
We consider the problem of linear regression with self-selection bias in the unknown-index setting, as introduced in recent work by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [STOC 2023]. In this model, one observes $m$ i.i.d. samples $(\mathbf{x}_{\ell},z_{\ell})_{\ell=1}^m$ where $z_{\ell}=\max_{i\in [k]}\{\mathbf{x}_{\ell}^T\mathbf{w}_i+η_{i,\ell}\}$, but the maximizing index $i_{\ell}$ is unobserved. Here, the $\mathbf{x}_{\ell}$ are assumed to be $\mathcal{N}(0,I_n)$ and the noise distribution $\mathbfη_{\ell}\sim \mathcal{D}$ is centered and independent of $\mathbf{x}_{\ell}$. We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $\mathbf{w}_1,\ldots,\mathbf{w}_k\in \mathbb{R}^n$ up to additive $\ell_2$-error $\varepsilon$ with polynomial sample complexity $\tilde{O}(n)\cdot \mathsf{poly}(k,1/\varepsilon)$ and significantly improved time complexity $\mathsf{poly}(n,k,1/\varepsilon)+O(\log(k)/\varepsilon)^{O(k)}$. When $k=O(1)$, our algorithm runs in $\mathsf{poly}(n,1/\varepsilon)$ time, generalizing the polynomial guarantee of an explicit moment matching algorithm of Cherapanamjeri, et al. for $k=2$ and when it is known that $\mathcal{D}=\mathcal{N}(0,I_k)$. Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression where the added noise is taken outside the maximum. For this problem, our algorithm is efficient in a much larger range of $k$ than the state-of-the-art due to Ghosh, Pananjady, Guntuboyina, and Ramchandran [IEEE Trans. Inf. Theory 2022] for not too small $\varepsilon$, and leads to improved algorithms for any $\varepsilon$ by providing a warm start for existing local convergence methods.
