Semi-supervised linear regression with missing covariates

Benedict M. Risebrow; Thomas B. Berrett

Semi-supervised linear regression with missing covariates

Benedict M. Risebrow, Thomas B. Berrett

TL;DR

Results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present, and the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present.

Abstract

Missing values in datasets are common in applied statistics. For regression problems, theoretical work thus far has largely considered the issue of missing covariates as distinct from missing responses. However, in practice, many datasets have both forms of missingness. Motivated by this gap, we study linear regression with a labelled dataset containing missing covariates, potentially alongside an unlabelled dataset. We consider both structured (blockwise-missing) and unstructured missingness patterns, along with sparse and non-sparse regression parameters. For the non-sparse case, we provide an estimator based on imputing the missing data combined with a reweighting step. For the high-dimensional sparse case, we use a modified version of the Dantzig selector. We provide non-asymptotic upper bounds on the risk of both procedures. These are matched by several new minimax lower bounds, demonstrating the rate optimality of our estimators. Notably, even when the linear model is well-specified, our results characterise substantial differences in the minimax rates when unlabelled data is present relative to the fully supervised setting. Particular consequences of our sparse and non-sparse results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present. Our theory is coupled with extensive simulations and a semi-synthetic application to the California housing dataset.

Semi-supervised linear regression with missing covariates

TL;DR

Abstract

Paper Structure (44 sections, 30 theorems, 339 equations, 10 figures, 3 tables)

This paper contains 44 sections, 30 theorems, 339 equations, 10 figures, 3 tables.

Introduction
Motivation
Formal setting
Existing work
Contributions and outline
Notation
Low-dimensional results
Structured Missingness
Upper bound
Lower bound
Unstructured missingness
Lower bound
High-dimensional results
Unstructured missingness
Structured missingness
...and 29 more sections

Key Result

Theorem 1

Assume the noise distribution satisfies $\mathbb{E}\left[\epsilon^8\right]^{\frac{1}{4}}\leq\kappa_{\epsilon}\sigma^2$, for some $\kappa_{\epsilon}\geq 1$. Further assume the distribution of covariates satisfies Assumptions assump:small-ball, assump:subgauss and assump:eigen. Suppose that for some c We also assume that the estimate of the covariance $\hat{\Sigma}$ is symmetric, positive-definite a

Figures (10)

Figure 1: Simple monotonic pattern
Figure 2: A non-monotonic pattern
Figure 3: CC refers to a complete case analysis of the 100 complete cases via least squares. SI refers to the estimator \ref{['eq:OSS estimator definition initial']} with choices of weights $\hat{D}_{1}=\hat{D}_{2}=1$. ISS refers to our estimator \ref{['eq:OSS estimator definition initial']} with oracle weights $\hat{D}_{1}=1, \hat{D}_{2} = \frac{\sigma^2}{\sigma^2+(\beta^*_{M})^TS_M\beta^*_{M}}$, where $M=\{10\}$. Error bars are given by the ribbons.
Figure 4: CC denotes the OLS estimate on complete cases. ISS is our proposed estimator \ref{['eq:OSS estimator definition initial']} with oracle weights $\hat{D}_{1}=1, \hat{D}_{2}=\frac{\sigma^2}{\sigma^2+(\beta^*_{M})^TS_M\beta^*_{M}}$. For $c>0$, ISS$_c$ denotes the estimator with weights $\hat{D}_{1}=1, \hat{D}_{2}=\frac{c\sigma^2}{\sigma^2+(\beta^*_{M})^TS_M\beta^*_{M}}$. Error bars from 1,000 repetitions are shown by ribbons.
Figure 5: We compute our estimator \ref{['eq:OSS estimator definition initial']} with unlabelled sample size $N$ varying from $50$ to $5{,}000$. ISS is the ideal semi-supervised estimator \ref{['eq:OSS estimator definition initial']}. CC is the complete case estimator. Labelled sample sizes are $n_{1}=100$ and $n_{2}$ varying from $0$ to $100{,}000$. Error bars from 1,000 repetitions are shown by ribbons.
...and 5 more figures

Theorems & Definitions (61)

Example 1
Example 2
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Theorem 6
Theorem 7
Theorem 8
...and 51 more

Semi-supervised linear regression with missing covariates

TL;DR

Abstract

Semi-supervised linear regression with missing covariates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (61)