Table of Contents
Fetching ...

Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data

Eyar Azar, Boaz Nadler

TL;DR

This work studies semi-supervised learning for high-dimensional binary Gaussian classification with a sparse mean-difference $\Delta\bm\mu$. It derives nonasymptotic information-theoretic lower bounds for exact support recovery and provides computational lower bounds via the low-degree framework, outlining regions where SSL is beneficial and where it remains hard. The authors introduce a polynomial-time SSL algorithm, LSPCA, which screens features using labeled data and then applies PCA (or sparse-PCA) on unlabeled data to recover the sparse support and construct an accurate linear classifier, with guarantees in a blue regime of the parameter space. Theoretical results are complemented by simulations showing SSL can outperform supervised and unsupervised approaches and self-training SSL in appropriate regimes, highlighting provable benefits of combining labeled and unlabeled data for high-dimensional feature selection and classification.

Abstract

The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this work, we study SSL for high dimensional sparse Gaussian classification. To construct an accurate classifier a key task is feature selection, detecting the few variables that separate the two classes. % For this SSL setting, we analyze information theoretic lower bounds for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. % Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to construct in polynomial time an accurate SSL classifier. However, % any computationally efficient supervised or unsupervised learning schemes, that separately use only the labeled or unlabeled data would fail. Our work highlights the provable benefits of combining labeled and unlabeled data for {classification and} feature selection in high dimensions. We present simulations that complement our theoretical analysis.

Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data

TL;DR

This work studies semi-supervised learning for high-dimensional binary Gaussian classification with a sparse mean-difference . It derives nonasymptotic information-theoretic lower bounds for exact support recovery and provides computational lower bounds via the low-degree framework, outlining regions where SSL is beneficial and where it remains hard. The authors introduce a polynomial-time SSL algorithm, LSPCA, which screens features using labeled data and then applies PCA (or sparse-PCA) on unlabeled data to recover the sparse support and construct an accurate linear classifier, with guarantees in a blue regime of the parameter space. Theoretical results are complemented by simulations showing SSL can outperform supervised and unsupervised approaches and self-training SSL in appropriate regimes, highlighting provable benefits of combining labeled and unlabeled data for high-dimensional feature selection and classification.

Abstract

The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this work, we study SSL for high dimensional sparse Gaussian classification. To construct an accurate classifier a key task is feature selection, detecting the few variables that separate the two classes. % For this SSL setting, we analyze information theoretic lower bounds for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. % Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to construct in polynomial time an accurate SSL classifier. However, % any computationally efficient supervised or unsupervised learning schemes, that separately use only the labeled or unlabeled data would fail. Our work highlights the provable benefits of combining labeled and unlabeled data for {classification and} feature selection in high dimensions. We present simulations that complement our theoretical analysis.
Paper Structure (26 sections, 20 theorems, 186 equations, 3 figures, 1 algorithm)

This paper contains 26 sections, 20 theorems, 186 equations, 3 figures, 1 algorithm.

Key Result

Theorem 2.1

Fix $\delta\in(0,1)$. For any $(L,p,k)$ such that and for any support estimator $\hat{S}$ based on $\mathcal{D}_L$, it follows that $\max_{S \in \mathbb{S}}\mathop{\mathrm{\mathbb{P}}}\nolimits\left(\hat{S} \neq S\right) > \delta - \frac{\log 2}{\log(p-k+1)} .$

Figures (3)

  • Figure 1: Semi-supervised classification and support recovery regions. The red and green regions follow from previous works. Contributions of our work include identification of the orange and the blue regions.
  • Figure 2: Empirical simulation results. (Left) Support recovery, (Right) Classification error.
  • Figure 3: Empirical simulation results. (Left) Support recovery, (Right) Classification error.

Theorems & Definitions (32)

  • Theorem 2.1
  • Theorem 2.2
  • Theorem 2.3
  • Corollary 2.4
  • Conjecture 2.5: Informal
  • Theorem 2.6
  • Conjecture 2.7
  • Remark 3.1
  • Theorem 3.2
  • Lemma A.1
  • ...and 22 more