Table of Contents
Fetching ...

A Large Dimensional Analysis of Multi-task Semi-Supervised Learning

Victor Leger, Romain Couillet

TL;DR

This paper studies a simple linear classifier that unifies multi-task learning, semi-supervised learning, and uncertain labeling in a high-dimensional regime using random matrix theory. It derives deterministic equivalents for the unlabeled score, proving it is asymptotically Gaussian with computable mean and variance, and identifies an optimal label vector that maximizes class separability while enabling threshold-based decisions. The framework accommodates uncertain labeling and analyzes limited labeled data, providing explicit expressions for minimal asymptotic error and a practical strategy to tune hyperparameters without cross-validation. Experiments on synthetic and real data corroborate the theory, showing robust performance near information-theoretic bounds, effective handling of label uncertainty and class imbalance, and applicability to real-world, concentrated-vector data.

Abstract

This article conducts a large dimensional study of a simple yet quite versatile classification model, encompassing at once multi-task and semi-supervised learning, and taking into account uncertain labeling. Using tools from random matrix theory, we characterize the asymptotics of some key functionals, which allows us on the one hand to predict the performances of the algorithm, and on the other hand to reveal some counter-intuitive guidance on how to use it efficiently. The model, powerful enough to provide good performance guarantees, is also straightforward enough to provide strong insights into its behavior.

A Large Dimensional Analysis of Multi-task Semi-Supervised Learning

TL;DR

This paper studies a simple linear classifier that unifies multi-task learning, semi-supervised learning, and uncertain labeling in a high-dimensional regime using random matrix theory. It derives deterministic equivalents for the unlabeled score, proving it is asymptotically Gaussian with computable mean and variance, and identifies an optimal label vector that maximizes class separability while enabling threshold-based decisions. The framework accommodates uncertain labeling and analyzes limited labeled data, providing explicit expressions for minimal asymptotic error and a practical strategy to tune hyperparameters without cross-validation. Experiments on synthetic and real data corroborate the theory, showing robust performance near information-theoretic bounds, effective handling of label uncertainty and class imbalance, and applicability to real-world, concentrated-vector data.

Abstract

This article conducts a large dimensional study of a simple yet quite versatile classification model, encompassing at once multi-task and semi-supervised learning, and taking into account uncertain labeling. Using tools from random matrix theory, we characterize the asymptotics of some key functionals, which allows us on the one hand to predict the performances of the algorithm, and on the other hand to reveal some counter-intuitive guidance on how to use it efficiently. The model, powerful enough to provide good performance guarantees, is also straightforward enough to provide strong insights into its behavior.
Paper Structure (22 sections, 7 theorems, 75 equations, 9 figures, 1 algorithm)

This paper contains 22 sections, 7 theorems, 75 equations, 9 figures, 1 algorithm.

Key Result

Theorem 3

Under Assumptions ass:data_distribution_2 and ass:growth_rate, and if labels are given by eq:simple_degrees, for any unlabeled sample ${\mathbf{x}}\in\mathcal{C}_j^{t}$, and $f$ being its associated score, with $m_j^t = (1-\delta^t){{\mathbf{a}}_j^t}^{\sf T}\tilde{{\mathbf{y}}}$ and $\sigma^t = (1-\delta^t)\sqrt{\tilde{{\mathbf{y}}}^{\sf T}{\mathbf{B}}^t\tilde{{\mathbf{y}}}}$ and where

Figures (9)

  • Figure 1: Asymptotic probability distribution of the score function $f$ for samples of both classes. The classification errors expressed in Remark \ref{['rm:error']} can be interpreted as the area delimited by the density curve of $f$ and the threshold $\zeta^t$.
  • Figure 2: Joint evolution of optimal labeling and classification error as a function of correlation between tasks ($p=200$, $n_\ell^1=100$, $n_\ell^2=1000$, $n_u^1=n_u^2=250$). (Top) Optimal labels with normalization $\|\tilde{{\mathbf{y}}}\|=1$. Optimal labels adapt themselves to avoid negative transfer (Bottom) Classification error for both naive and optimal algorithms. Our algorithm is close to optimal, while naive labels induce a negative transfer when tasks are not related enough.
  • Figure 3: Joint evolution of optimal labeling and classification error as a function of the number of labeled data in class $\mathcal{C}_1$. The total number of labeled data is constant ($n_\ell=n_{\ell 1}+n_{\ell 2}=1000$, $p=200$, $n_{u1}=n_{u2}=200$). (Top) Optimal labels with normalization $\|\tilde{{\mathbf{y}}}\|=1$, and optimal threshold $\zeta$ (also normalized). Optimal labels adapt themselves to compensate the class imbalances (Bottom) Theoretical and empirical classification error for both naive and optimal labels and threshold. The overall error is better with our algorithm, while naive labels and threshold induce a high error for the most represented class.
  • Figure 4: Number of imprecise data $n_i$ needed to reach the same performance one had using $n_r$ samples of reliable data, for different values of $r$ ($1$-task, $p=200$, $n_{u1}=n_{u2}=200$). For each point, $n_r$ is a random number between $20$ and $400$. The figure strongly suggests that $n_i$ is a linear function of $n_r$.
  • Figure 5: Ratio $\frac{n_r}{n_i}$ for different values of reliability ($r$) and difficulty of the task ($D=1/\|{\boldsymbol{\mu}}_1-{\boldsymbol{\mu}}_2\|$). The higher the ratio is, the more effective is the contribution of imprecise samples to the task. Both figures show that the harder the task is, the least useful imprecise samples are. However, an increasing of $r$ leads to significantely better results. (Top) Ratio $\frac{n_r}{n_i}$ as a function of the reliability of imprecise data, for different values of difficulty. (Bottom) Ratio $\frac{n_r}{n_i}$ as a function of the difficulty of the task, for different values of reliability.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Theorem 3
  • Definition 4
  • Remark 5
  • Proposition 6
  • Proposition 7
  • Theorem 8
  • Lemma 9
  • Lemma 10
  • Lemma 11