Table of Contents
Fetching ...

Dependable Exploitation of High-Dimensional Unlabeled Data in an Assumption-Lean Framework

Chao Ying, Siyi Deng, Yang Ning, Jiwei Zhao, Heping Zhang

Abstract

Semi-supervised learning has attracted significant attention due to the proliferation of applications featuring limited labeled data but abundant unlabeled data. In this paper, we examine the statistical inference problem in an assumption-lean framework which involves a high-dimensional regression parameter, defined by minimizing the least squares, within the context of semi-supervised learning. We investigate when and how unlabeled data can enhance the estimation efficiency of a regression parameter functional. First, we demonstrate that a straightforward debiased estimator can only be more efficient than its supervised counterpart if the unknown conditional mean function can be consistently estimated at an appropriate rate. Otherwise, incorporating unlabeled data can actually be counterproductive. To address this vulnerability, we propose a novel estimator guaranteed to be at least as efficient as the supervised baseline, even when the conditional mean function is misspecified. This ensures the dependable use of unlabeled data for statistical inference. Finally, we extend our approach to the general M-estimation framework, and demonstrate the effectiveness of our methodology through comprehensive simulation studies and a real data application.

Dependable Exploitation of High-Dimensional Unlabeled Data in an Assumption-Lean Framework

Abstract

Semi-supervised learning has attracted significant attention due to the proliferation of applications featuring limited labeled data but abundant unlabeled data. In this paper, we examine the statistical inference problem in an assumption-lean framework which involves a high-dimensional regression parameter, defined by minimizing the least squares, within the context of semi-supervised learning. We investigate when and how unlabeled data can enhance the estimation efficiency of a regression parameter functional. First, we demonstrate that a straightforward debiased estimator can only be more efficient than its supervised counterpart if the unknown conditional mean function can be consistently estimated at an appropriate rate. Otherwise, incorporating unlabeled data can actually be counterproductive. To address this vulnerability, we propose a novel estimator guaranteed to be at least as efficient as the supervised baseline, even when the conditional mean function is misspecified. This ensures the dependable use of unlabeled data for statistical inference. Finally, we extend our approach to the general M-estimation framework, and demonstrate the effectiveness of our methodology through comprehensive simulation studies and a real data application.

Paper Structure

This paper contains 44 sections, 163 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: Simulation results for Model 1 with $p=200$: absolute difference between the empirical 95% coverage probability and the nominal level 0.95. In all panels, rows represent different parameters, columns represent different $N/n$ ratios, and each panel plots the trend over the sample size $n$.
  • Figure 2: Real Data Application Results: the point estimates and the corresponding confidence intervals of the methods, Oracle, D-Lasso1, D-SSL, and S-SSL, with sample size $n=100$ and $N\in\{1000,2000,3000\}$.
  • Figure S.1: Simulation results for Model 1 with $p=500$: absolute difference between the empirical 95% coverage probability and the nominal level 0.95. In all panels, rows represent different parameters, columns represent different $N/n$ ratios, and each panel plots the trend over the sample size $n$.

Theorems & Definitions (8)

  • proof
  • proof : Proof of Lemma \ref{['lem_sig']}
  • proof : Proof of Proposition \ref{['prop_variance1']}
  • proof
  • proof
  • proof
  • proof
  • proof