Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data
Eyar Azar, Boaz Nadler
TL;DR
This work studies semi-supervised learning for high-dimensional binary Gaussian classification with a sparse mean-difference $\Delta\bm\mu$. It derives nonasymptotic information-theoretic lower bounds for exact support recovery and provides computational lower bounds via the low-degree framework, outlining regions where SSL is beneficial and where it remains hard. The authors introduce a polynomial-time SSL algorithm, LSPCA, which screens features using labeled data and then applies PCA (or sparse-PCA) on unlabeled data to recover the sparse support and construct an accurate linear classifier, with guarantees in a blue regime of the parameter space. Theoretical results are complemented by simulations showing SSL can outperform supervised and unsupervised approaches and self-training SSL in appropriate regimes, highlighting provable benefits of combining labeled and unlabeled data for high-dimensional feature selection and classification.
Abstract
The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this work, we study SSL for high dimensional sparse Gaussian classification. To construct an accurate classifier a key task is feature selection, detecting the few variables that separate the two classes. % For this SSL setting, we analyze information theoretic lower bounds for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. % Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to construct in polynomial time an accurate SSL classifier. However, % any computationally efficient supervised or unsupervised learning schemes, that separately use only the labeled or unlabeled data would fail. Our work highlights the provable benefits of combining labeled and unlabeled data for {classification and} feature selection in high dimensions. We present simulations that complement our theoretical analysis.
