A Large Dimensional Analysis of Multi-task Semi-Supervised Learning
Victor Leger, Romain Couillet
TL;DR
This paper studies a simple linear classifier that unifies multi-task learning, semi-supervised learning, and uncertain labeling in a high-dimensional regime using random matrix theory. It derives deterministic equivalents for the unlabeled score, proving it is asymptotically Gaussian with computable mean and variance, and identifies an optimal label vector that maximizes class separability while enabling threshold-based decisions. The framework accommodates uncertain labeling and analyzes limited labeled data, providing explicit expressions for minimal asymptotic error and a practical strategy to tune hyperparameters without cross-validation. Experiments on synthetic and real data corroborate the theory, showing robust performance near information-theoretic bounds, effective handling of label uncertainty and class imbalance, and applicability to real-world, concentrated-vector data.
Abstract
This article conducts a large dimensional study of a simple yet quite versatile classification model, encompassing at once multi-task and semi-supervised learning, and taking into account uncertain labeling. Using tools from random matrix theory, we characterize the asymptotics of some key functionals, which allows us on the one hand to predict the performances of the algorithm, and on the other hand to reveal some counter-intuitive guidance on how to use it efficiently. The model, powerful enough to provide good performance guarantees, is also straightforward enough to provide strong insights into its behavior.
