Table of Contents
Fetching ...

Linear Regression with Unknown Truncation Beyond Gaussian Features

Alexandros Kouridakis, Anay Mehrotra, Alkis Kalavasis, Constantine Caramanis

TL;DR

This work gives the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian.

Abstract

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

Linear Regression with Unknown Truncation Beyond Gaussian Features

TL;DR

This work gives the first algorithm for truncated linear regression with unknown survival set that runs in time, by only requiring that the feature vectors are sub-Gaussian.

Abstract

In truncated linear regression, samples are shown only when the outcome falls inside a certain survival set and the goal is to estimate the unknown -dimensional regressor . This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where is precisely known. The more practically relevant case, where is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a run time for achieving accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.
Paper Structure (59 sections, 52 theorems, 228 equations, 3 figures, 5 algorithms)

This paper contains 59 sections, 52 theorems, 228 equations, 3 figures, 5 algorithms.

Key Result

Theorem 1.1

Assume asmp:massMainasmp:subGaussianMainasmp:conservationMain hold with $\alpha,\rho{\,=\,}\Omega(1)$, $\sigma{\,=\,}O(1)$, $R{\,=\,}\mathrm{poly}(d)$, and $S^\star$ is a union of at most $k$ intervals (for known $k$). Then, there is an algorithm that given $n = \mathrm{poly}(dk/\varepsilon)$ i.i.d.

Figures (3)

  • Figure 1: Example of positive-only learning under smoothness. Here, we wish to PAC learn the interval $[0,1]$ under base distribution $\euscr{D}^\star$. We are given sample access only to $\euscr{D}^\star_{+}$, which is the restriction of $\euscr{D}^\star$ to $[0,1]$, and to $\euscr{D}$, which is smooth w.r.t. $\euscr{D}^\star$. In particular, the dotted blue line shows the function $2 \euscr{D}(x)^{1/10}$, which lies above the density $\euscr{D}^\star(x)$, hence $\euscr{D}$ is $(1/2, 10)$-smooth w.r.t. $\euscr{D}^\star$.
  • Figure 2: Example of \ref{['alg:unions']} in action, for learning a union of $k = 2$ intervals with error $\varepsilon = 0.2$. The points above represent samples on the real line. The red diamonds are positive samples drawn from the distribution $\euscr{D}^\star_{+}$, while the black dots are samples from the distribution $\euscr{D}$ which is smooth w.r.t. the target $\euscr{D}^\star$. The algorithm considers the intervals defined by consecutive red points, and discards the $\frac{k-1}{\varepsilon} = 5$ intervals with the most black points. The final output is the union of the intervals denoted in blue.
  • Figure 3: Illustrations for \ref{['scenario:astronomy']}. A multi-fiber spectroscope from the APOGEE survey, where the requirement of a minimum distance between fibers creates spatially dependent selection effects. Image credit: sdss_dr17_instruments.

Theorems & Definitions (91)

  • Theorem 1.1: Informal; see \ref{['thm:main']}
  • Definition 1: Truncated Linear Regression Model
  • Theorem 3.1
  • Definition 2: Smoothness; lee2025learning
  • Theorem 3.2: Theorem 1.1 of lee2025learning
  • Theorem 3.3: Positive PAC Learning unions of intervals under Smoothness
  • Lemma 3.3: Smoothness
  • Example A.1: Malmquist Bias
  • Example A.2: Truncation Biases in Large-Scale Astronomical Surveys
  • Corollary B.1
  • ...and 81 more