Asymptotic Bayes risk of semi-supervised learning with uncertain labeling
Victor Leger, Romain Couillet
TL;DR
This work derives the asymptotic Bayes risk for semi-supervised binary classification in a Gaussian mixture model with uncertain labeling. It establishes a high-dimensional fixed-point framework where the Bayes risk converges to Q(sqrt(q_u)) and the overlaps (q_u, q_v) obey coupled equations involving the labeling uncertainty through F_eps and an auxiliary psi_eps function. The analysis connects the impact of labeling uncertainty to a corollary that expresses q_v in terms of the average labeling error eps_bar^2 and a data usefulness function F(q_u), providing intuition on when unlabeled data help. Simulations juxtapose the Bayes-optimal bound with the Leger 2024 algorithm, confirming near-optimal behavior and clarifying how unlabeled data contribute as a function of SNR, data dimensionality, and labeling confidence. Overall, the results offer a principled criterion for when semi-supervised labeling improves performance and a diagnostic bridge between theory and an existing near-optimal algorithm.
Abstract
This article considers a semi-supervised classification setting on a Gaussian mixture model, where the data is not labeled strictly as usual, but instead with uncertain labels. Our main aim is to compute the Bayes risk for this model. We compare the behavior of the Bayes risk and the best known algorithm for this model. This comparison eventually gives new insights over the algorithm.
