Linear Regression with Unknown Truncation Beyond Gaussian Features

Alexandros Kouridakis; Anay Mehrotra; Alkis Kalavasis; Constantine Caramanis

Linear Regression with Unknown Truncation Beyond Gaussian Features

Alexandros Kouridakis, Anay Mehrotra, Alkis Kalavasis, Constantine Caramanis

TL;DR

This work gives the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian.

Abstract

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

Linear Regression with Unknown Truncation Beyond Gaussian Features

TL;DR

This work gives the first algorithm for truncated linear regression with unknown survival set that runs in

time, by only requiring that the feature vectors are sub-Gaussian.

Abstract

In truncated linear regression, samples

are shown only when the outcome

falls inside a certain survival set

and the goal is to estimate the unknown

-dimensional regressor

. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where

is precisely known. The more practically relevant case, where

is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a

run time for achieving

accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in

time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

Paper Structure (59 sections, 52 theorems, 228 equations, 3 figures, 5 algorithms)

This paper contains 59 sections, 52 theorems, 228 equations, 3 figures, 5 algorithms.

Introduction
Prior Work.
Our Contributions.
Technical Overview.
Approach with known $S^\star$.
Existing approach with unknown $S^\star$.
Issue I: Properties and Optimization of ${\widetilde{\euscr{L}}}_S(\cdot)$ without Gaussianity.
Issue II: Efficiently learning $S^\star$ from positive samples only.
Related Works
Truncated Linear Regression.
Learning from Positive Examples.
Preliminaries and Model
Notation.
Sub-Gaussianity.
Our Results
...and 44 more sections

Key Result

Theorem 1.1

Assume asmp:massMainasmp:subGaussianMainasmp:conservationMain hold with $\alpha,\rho{\,=\,}\Omega(1)$, $\sigma{\,=\,}O(1)$, $R{\,=\,}\mathrm{poly}(d)$, and $S^\star$ is a union of at most $k$ intervals (for known $k$). Then, there is an algorithm that given $n = \mathrm{poly}(dk/\varepsilon)$ i.i.d.

Figures (3)

Figure 1: Example of positive-only learning under smoothness. Here, we wish to PAC learn the interval $[0,1]$ under base distribution $\euscr{D}^\star$. We are given sample access only to $\euscr{D}^\star_{+}$, which is the restriction of $\euscr{D}^\star$ to $[0,1]$, and to $\euscr{D}$, which is smooth w.r.t. $\euscr{D}^\star$. In particular, the dotted blue line shows the function $2 \euscr{D}(x)^{1/10}$, which lies above the density $\euscr{D}^\star(x)$, hence $\euscr{D}$ is $(1/2, 10)$-smooth w.r.t. $\euscr{D}^\star$.
Figure 2: Example of \ref{['alg:unions']} in action, for learning a union of $k = 2$ intervals with error $\varepsilon = 0.2$. The points above represent samples on the real line. The red diamonds are positive samples drawn from the distribution $\euscr{D}^\star_{+}$, while the black dots are samples from the distribution $\euscr{D}$ which is smooth w.r.t. the target $\euscr{D}^\star$. The algorithm considers the intervals defined by consecutive red points, and discards the $\frac{k-1}{\varepsilon} = 5$ intervals with the most black points. The final output is the union of the intervals denoted in blue.
Figure 3: Illustrations for \ref{['scenario:astronomy']}. A multi-fiber spectroscope from the APOGEE survey, where the requirement of a minimum distance between fibers creates spatially dependent selection effects. Image credit: sdss_dr17_instruments.

Theorems & Definitions (91)

Theorem 1.1: Informal; see \ref{['thm:main']}
Definition 1: Truncated Linear Regression Model
Theorem 3.1
Definition 2: Smoothness; lee2025learning
Theorem 3.2: Theorem 1.1 of lee2025learning
Theorem 3.3: Positive PAC Learning unions of intervals under Smoothness
Lemma 3.3: Smoothness
Example A.1: Malmquist Bias
Example A.2: Truncation Biases in Large-Scale Astronomical Surveys
Corollary B.1
...and 81 more

Linear Regression with Unknown Truncation Beyond Gaussian Features

TL;DR

Abstract

Linear Regression with Unknown Truncation Beyond Gaussian Features

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (91)