Table of Contents
Fetching ...

Semi-supervised linear regression with missing covariates

Benedict M. Risebrow, Thomas B. Berrett

TL;DR

Results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present, and the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present.

Abstract

Missing values in datasets are common in applied statistics. For regression problems, theoretical work thus far has largely considered the issue of missing covariates as distinct from missing responses. However, in practice, many datasets have both forms of missingness. Motivated by this gap, we study linear regression with a labelled dataset containing missing covariates, potentially alongside an unlabelled dataset. We consider both structured (blockwise-missing) and unstructured missingness patterns, along with sparse and non-sparse regression parameters. For the non-sparse case, we provide an estimator based on imputing the missing data combined with a reweighting step. For the high-dimensional sparse case, we use a modified version of the Dantzig selector. We provide non-asymptotic upper bounds on the risk of both procedures. These are matched by several new minimax lower bounds, demonstrating the rate optimality of our estimators. Notably, even when the linear model is well-specified, our results characterise substantial differences in the minimax rates when unlabelled data is present relative to the fully supervised setting. Particular consequences of our sparse and non-sparse results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present. Our theory is coupled with extensive simulations and a semi-synthetic application to the California housing dataset.

Semi-supervised linear regression with missing covariates

TL;DR

Results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present, and the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present.

Abstract

Missing values in datasets are common in applied statistics. For regression problems, theoretical work thus far has largely considered the issue of missing covariates as distinct from missing responses. However, in practice, many datasets have both forms of missingness. Motivated by this gap, we study linear regression with a labelled dataset containing missing covariates, potentially alongside an unlabelled dataset. We consider both structured (blockwise-missing) and unstructured missingness patterns, along with sparse and non-sparse regression parameters. For the non-sparse case, we provide an estimator based on imputing the missing data combined with a reweighting step. For the high-dimensional sparse case, we use a modified version of the Dantzig selector. We provide non-asymptotic upper bounds on the risk of both procedures. These are matched by several new minimax lower bounds, demonstrating the rate optimality of our estimators. Notably, even when the linear model is well-specified, our results characterise substantial differences in the minimax rates when unlabelled data is present relative to the fully supervised setting. Particular consequences of our sparse and non-sparse results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present. Our theory is coupled with extensive simulations and a semi-synthetic application to the California housing dataset.
Paper Structure (44 sections, 30 theorems, 339 equations, 10 figures, 3 tables)

This paper contains 44 sections, 30 theorems, 339 equations, 10 figures, 3 tables.

Key Result

Theorem 1

Assume the noise distribution satisfies $\mathbb{E}\left[\epsilon^8\right]^{\frac{1}{4}}\leq\kappa_{\epsilon}\sigma^2$, for some $\kappa_{\epsilon}\geq 1$. Further assume the distribution of covariates satisfies Assumptions assump:small-ball, assump:subgauss and assump:eigen. Suppose that for some c We also assume that the estimate of the covariance $\hat{\Sigma}$ is symmetric, positive-definite a

Figures (10)

  • Figure 1: Simple monotonic pattern
  • Figure 2: A non-monotonic pattern
  • Figure 3: CC refers to a complete case analysis of the 100 complete cases via least squares. SI refers to the estimator \ref{['eq:OSS estimator definition initial']} with choices of weights $\hat{D}_{1}=\hat{D}_{2}=1$. ISS refers to our estimator \ref{['eq:OSS estimator definition initial']} with oracle weights $\hat{D}_{1}=1, \hat{D}_{2} = \frac{\sigma^2}{\sigma^2+(\beta^*_{M})^TS_M\beta^*_{M}}$, where $M=\{10\}$. Error bars are given by the ribbons.
  • Figure 4: CC denotes the OLS estimate on complete cases. ISS is our proposed estimator \ref{['eq:OSS estimator definition initial']} with oracle weights $\hat{D}_{1}=1, \hat{D}_{2}=\frac{\sigma^2}{\sigma^2+(\beta^*_{M})^TS_M\beta^*_{M}}$. For $c>0$, ISS$_c$ denotes the estimator with weights $\hat{D}_{1}=1, \hat{D}_{2}=\frac{c\sigma^2}{\sigma^2+(\beta^*_{M})^TS_M\beta^*_{M}}$. Error bars from 1,000 repetitions are shown by ribbons.
  • Figure 5: We compute our estimator \ref{['eq:OSS estimator definition initial']} with unlabelled sample size $N$ varying from $50$ to $5{,}000$. ISS is the ideal semi-supervised estimator \ref{['eq:OSS estimator definition initial']}. CC is the complete case estimator. Labelled sample sizes are $n_{1}=100$ and $n_{2}$ varying from $0$ to $100{,}000$. Error bars from 1,000 repetitions are shown by ribbons.
  • ...and 5 more figures

Theorems & Definitions (61)

  • Example 1
  • Example 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • ...and 51 more