Table of Contents
Fetching ...

Asymptotic Bayes risk of semi-supervised learning with uncertain labeling

Victor Leger, Romain Couillet

TL;DR

This work derives the asymptotic Bayes risk for semi-supervised binary classification in a Gaussian mixture model with uncertain labeling. It establishes a high-dimensional fixed-point framework where the Bayes risk converges to Q(sqrt(q_u)) and the overlaps (q_u, q_v) obey coupled equations involving the labeling uncertainty through F_eps and an auxiliary psi_eps function. The analysis connects the impact of labeling uncertainty to a corollary that expresses q_v in terms of the average labeling error eps_bar^2 and a data usefulness function F(q_u), providing intuition on when unlabeled data help. Simulations juxtapose the Bayes-optimal bound with the Leger 2024 algorithm, confirming near-optimal behavior and clarifying how unlabeled data contribute as a function of SNR, data dimensionality, and labeling confidence. Overall, the results offer a principled criterion for when semi-supervised labeling improves performance and a diagnostic bridge between theory and an existing near-optimal algorithm.

Abstract

This article considers a semi-supervised classification setting on a Gaussian mixture model, where the data is not labeled strictly as usual, but instead with uncertain labels. Our main aim is to compute the Bayes risk for this model. We compare the behavior of the Bayes risk and the best known algorithm for this model. This comparison eventually gives new insights over the algorithm.

Asymptotic Bayes risk of semi-supervised learning with uncertain labeling

TL;DR

This work derives the asymptotic Bayes risk for semi-supervised binary classification in a Gaussian mixture model with uncertain labeling. It establishes a high-dimensional fixed-point framework where the Bayes risk converges to Q(sqrt(q_u)) and the overlaps (q_u, q_v) obey coupled equations involving the labeling uncertainty through F_eps and an auxiliary psi_eps function. The analysis connects the impact of labeling uncertainty to a corollary that expresses q_v in terms of the average labeling error eps_bar^2 and a data usefulness function F(q_u), providing intuition on when unlabeled data help. Simulations juxtapose the Bayes-optimal bound with the Leger 2024 algorithm, confirming near-optimal behavior and clarifying how unlabeled data contribute as a function of SNR, data dimensionality, and labeling confidence. Overall, the results offer a principled criterion for when semi-supervised labeling improves performance and a diagnostic bridge between theory and an existing near-optimal algorithm.

Abstract

This article considers a semi-supervised classification setting on a Gaussian mixture model, where the data is not labeled strictly as usual, but instead with uncertain labels. Our main aim is to compute the Bayes risk for this model. We compare the behavior of the Bayes risk and the best known algorithm for this model. This comparison eventually gives new insights over the algorithm.
Paper Structure (6 sections, 3 theorems, 22 equations, 5 figures)

This paper contains 6 sections, 3 theorems, 22 equations, 5 figures.

Key Result

Theorem 1

Under the previous assumptions, as $p \to \infty$,

Figures (5)

  • Figure 1: Relative error of the approximation $\tilde{F}_\varepsilon(q)\simeq F_\varepsilon(q)$. The error is at most $7\%$, and shrinks for either $\varepsilon=0$, $\varepsilon=1$ or large $q$
  • Figure 2: Usefulness of unlabeled data as a function of the Bayes risk of the task. Interestingly, the only criterion to determinate the effectiveness of unlabeled data is how solvable the task is. The lower the Bayes risk is, the more unlabeled data are useful to perform the task.
  • Figure 3: Number of labeled data $n_\ell$ needed to perform the same performance, as a function of the confidence in the data labeling, for different values of $\eta$ ($n=1000, p=200, \lambda=0.25$). The empirical values are displayed in dots, and theoretical prediction (built on the results of Section \ref{['sec:main']}) in plain line. The least reliable the data is, the more data is needed to reach the same performance.
  • Figure 4: Percentage of error reduction by using the semi-supervised algorithm instead of the supervised one, as a function of the SNR $\lambda$ ($n=p=200, \eta=0.2$). The easier the task is, the higher the semi-supervised contribution is, because the classification error is lower.
  • Figure 5: Percentage of error reduction by using the semi-supervised algorithm instead of the supervised one, as a function of the ratio $c=n/p$ ($\lambda=2, p=200, \eta=0.2$). As $c$ grows, the semi-supervised algorithm is more and more effective comparatively to the supervised one, relatively to oracle error, because the classification error is lower and oracle error stays constant. However, if the oracle error ceases to be the reference, then the contribution of semi-supervised decreases for high values of $c$, because both algorithms edge closer to the oracle bound, which stays far from zero.

Theorems & Definitions (5)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Corollary 1
  • Lemma 1