Table of Contents
Fetching ...

A Unified Framework for Semiparametrically Efficient Semi-Supervised Learning

Zichun Xu, Daniela Witten, Ali Shojaie

TL;DR

This paper develops a unified semiparametric efficiency framework for semi-supervised learning with labeled data $\mathcal L_n$ and unlabeled covariates $\mathcal U_N$. It derives efficiency lower bounds for ideal ISS and ordinary OSS settings, and shows unlabeled data can improve inference when the target parameter is not well-specified; it also proves no improvement under well-specified parameters. Two practical estimators—safe and efficient semi-supervised estimators—build on an initial supervised estimator; the safe version uses regression to approximate the conditional influence function, guaranteeing at least as much efficiency as the supervised method, while the efficient version uses growing basis expansions to achieve the semiparametric bound. The framework connects to prediction-powered inference (PPI), clarifies when independently trained predictors help or don’t help, and yields scalable estimators that can incorporate black-box models. Applications to M-estimation, U-statistics, and average treatment effect demonstrate the approach across standard inferential tasks, with simulations showing when semi-supervised gains materialize and when they do not.”

Abstract

We consider statistical inference under a semi-supervised setting where we have access to both a labeled dataset consisting of pairs $\{X_i, Y_i \}_{i=1}^n$ and an unlabeled dataset $\{ X_i \}_{i=n+1}^{n+N}$. We ask the question: under what circumstances, and by how much, can incorporating the unlabeled dataset improve upon inference using the labeled data? To answer this question, we investigate semi-supervised learning through the lens of semiparametric efficiency theory. We characterize the efficiency lower bound under the semi-supervised setting for an arbitrary inferential problem, and show that incorporating unlabeled data can potentially improve efficiency if the parameter is not well-specified. We then propose two types of semi-supervised estimators: a safe estimator that imposes minimal assumptions, is simple to compute, and is guaranteed to be at least as efficient as the initial supervised estimator; and an efficient estimator, which -- under stronger assumptions -- achieves the semiparametric efficiency bound. Our findings unify existing semiparametric efficiency results for particular special cases, and extend these results to a much more general class of problems. Moreover, we show that our estimators can flexibly incorporate predicted outcomes arising from ``black-box" machine learning models, and thereby achieve the same goal as prediction-powered inference (PPI), but with superior theoretical guarantees. We also provide a complete understanding of the theoretical basis for the existing set of PPI methods. Finally, we apply the theoretical framework developed to derive and analyze efficient semi-supervised estimators in a number of settings, including M-estimation, U-statistics, and average treatment effect estimation, and demonstrate the performance of the proposed estimators via simulations.

A Unified Framework for Semiparametrically Efficient Semi-Supervised Learning

TL;DR

This paper develops a unified semiparametric efficiency framework for semi-supervised learning with labeled data and unlabeled covariates . It derives efficiency lower bounds for ideal ISS and ordinary OSS settings, and shows unlabeled data can improve inference when the target parameter is not well-specified; it also proves no improvement under well-specified parameters. Two practical estimators—safe and efficient semi-supervised estimators—build on an initial supervised estimator; the safe version uses regression to approximate the conditional influence function, guaranteeing at least as much efficiency as the supervised method, while the efficient version uses growing basis expansions to achieve the semiparametric bound. The framework connects to prediction-powered inference (PPI), clarifies when independently trained predictors help or don’t help, and yields scalable estimators that can incorporate black-box models. Applications to M-estimation, U-statistics, and average treatment effect demonstrate the approach across standard inferential tasks, with simulations showing when semi-supervised gains materialize and when they do not.”

Abstract

We consider statistical inference under a semi-supervised setting where we have access to both a labeled dataset consisting of pairs and an unlabeled dataset . We ask the question: under what circumstances, and by how much, can incorporating the unlabeled dataset improve upon inference using the labeled data? To answer this question, we investigate semi-supervised learning through the lens of semiparametric efficiency theory. We characterize the efficiency lower bound under the semi-supervised setting for an arbitrary inferential problem, and show that incorporating unlabeled data can potentially improve efficiency if the parameter is not well-specified. We then propose two types of semi-supervised estimators: a safe estimator that imposes minimal assumptions, is simple to compute, and is guaranteed to be at least as efficient as the initial supervised estimator; and an efficient estimator, which -- under stronger assumptions -- achieves the semiparametric efficiency bound. Our findings unify existing semiparametric efficiency results for particular special cases, and extend these results to a much more general class of problems. Moreover, we show that our estimators can flexibly incorporate predicted outcomes arising from ``black-box" machine learning models, and thereby achieve the same goal as prediction-powered inference (PPI), but with superior theoretical guarantees. We also provide a complete understanding of the theoretical basis for the existing set of PPI methods. Finally, we apply the theoretical framework developed to derive and analyze efficient semi-supervised estimators in a number of settings, including M-estimation, U-statistics, and average treatment effect estimation, and demonstrate the performance of the proposed estimators via simulations.

Paper Structure

This paper contains 43 sections, 35 theorems, 298 equations, 5 figures, 5 tables.

Key Result

Lemma 2.1

Suppose there exists a regular and asymptotically linear estimator of $\theta^*$. Let ${{\varphi}^*_{\eta^*}}(z)$ denote the efficient influence function of $\hat{\theta}_n$ at ${{\mathbb P}^*}$ relative to ${{\mathcal{P}}}$. Then, it follows that for any regular and asymptotically linear estimator

Figures (5)

  • Figure 1: Standard errors of estimators of the mean, as described in Section \ref{['subsec:simu_mean']}, averaged over 1,000 simulations with $n=1,000$. Left: When the conditional influence function is linear (Setting 1), $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x) = x$ achieves the efficiency lower bound in the OSS setting. Center: When the conditional influence function is non-linear (Setting 2), $\hat{\theta}_{n,N}^{\text{safe}}$ with $g(x) = x$ is no longer efficient, whereas $\hat{\theta}_{n,N}^{\text{eff.}}$ is efficient with a sufficient number of basis functions. Right: For a well-specified estimation problem in the sense of Definition \ref{['def:well_specification']} (Setting 3), no semi-supervised method can improve upon the supervised estimator, as shown in Corollary \ref{['cor:well_specified_OSS']}.
  • Figure 2: Standard errors of estimators of the first parameter of the Poisson GLM, as described in Section \ref{['subsec:simu_glm']}, averaged over 1,000 simulations, with $n=1,000$. Left: When the model in non-linear (Setting 1), and with a sufficient number of basis functions, $\hat{\theta}_{n,N}^{\text{eff.}}$ nearly achieves the OSS semiparametric efficiency lower bound. Moreover, $\hat{\theta}_{n,N}^{\text{PPI}}$ with ${g_{\Tilde{\eta}_n}}(x) = {{\varphi}_{\Tilde{\eta}_n}}(x,f(x))$ outperforms the PPI++ estimators with the noisy prediction model. Right: For a well-specified estimation problem in the sense of Definition \ref{['def:well_specification']} (Setting 2), no semi-supervised method can improve upon the supervised estimator. This agrees with Corollary \ref{['cor:well_specified_OSS']}.
  • Figure 3: Estimated standard error for of $\theta_2^*$ in the Poisson GLM setting, detailed in Section \ref{['subsec:simu_glm']}. The results are similar to Figure \ref{['fig:glm']}.
  • Figure 4: Comparison of standard error for methods of variance estimation, averaged over 1,000 simulated datasets. Left: When the conditional influence function is non-linear, $\hat{\theta}_{n,N}^{\text{eff.}}$ is efficient with a sufficient number of basis functions. Right: For a well-specified estimation problem in the sense of Definition \ref{['def:well_specification']}, no semi-supervised method improves upon the supervised estimator, which aligns with Corollary \ref{['cor:well_specified_OSS']}.
  • Figure 5: Comparison of standard error for methods that estimate Kendall's $\tau$, averaged over 1,000 simulations. Left: When the conditional influence function is non-linear, $\hat{\theta}_{n,N}^{\text{eff.}}$ is efficient with a sufficient number of basis functions. Right: For a well-specified estimation problem in the sense of Definition \ref{['def:well_specification']}, no semi-supervised methods improves upon the supervised estimator, as shown in Corollary \ref{['cor:well_specified_OSS']}.

Theorems & Definitions (83)

  • Lemma 2.1
  • Theorem 3.1
  • Definition 3.1: Well-specified parameter
  • Theorem 3.2
  • Corollary 3.3
  • Proposition 3.4
  • Theorem 3.5
  • Remark 3.1: Choice of regression basis function $g(x)$
  • Theorem 3.6
  • Remark 3.2: Role of the marginal distribution ${{\mathbb P}^*_{X}}$
  • ...and 73 more