Table of Contents
Fetching ...

TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces

Guillaume Quétant, Pavlo Molchanov, Slava Voloshynovskiy

TL;DR

TwinTURBO tackles the challenge of fine-tuning foundation models with extremely limited labels by exploiting mutual information decomposition. It derives two lower bounds: one on the downstream task space $I(X;Y)$ and another on latent representations $I(X;Z^*)$, implemented via density parameterisations and a discriminator to manage the KL term, all within a lightweight adapter-based setup. The method realises practical losses (Categorical, Binary, and InfoNCE variants) and a discriminator-based critic to leverage unlabeled data, plus latent-space alignment losses to stabilize representations. Empirical results on MNIST, CIFAR-10, and SVHN under low-label regimes show substantial accuracy gains and reduced variance, underscoring the value of information-theoretic objectives for semi-supervised fine-tuning and hinting at extensions to multimodal settings.

Abstract

We present a semi-supervised fine-tuning framework for foundation models that utilises mutual information decomposition to address the challenges of training for a limited amount of labelled data. Our approach derives two distinct lower bounds: i) for the downstream task space, such as classification, optimised using conditional and marginal cross-entropy alongside Kullback-Leibler divergence, and ii) for the latent space representation, regularised and aligned using a contrastive-like decomposition. This fine-tuning strategy retains the pre-trained structure of the foundation model, modifying only a specialised projector module comprising a small transformer and a token aggregation technique. Experiments on several datasets demonstrate significant improvements in classification tasks under extremely low-labelled conditions by effectively leveraging unlabelled data.

TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces

TL;DR

TwinTURBO tackles the challenge of fine-tuning foundation models with extremely limited labels by exploiting mutual information decomposition. It derives two lower bounds: one on the downstream task space and another on latent representations , implemented via density parameterisations and a discriminator to manage the KL term, all within a lightweight adapter-based setup. The method realises practical losses (Categorical, Binary, and InfoNCE variants) and a discriminator-based critic to leverage unlabeled data, plus latent-space alignment losses to stabilize representations. Empirical results on MNIST, CIFAR-10, and SVHN under low-label regimes show substantial accuracy gains and reduced variance, underscoring the value of information-theoretic objectives for semi-supervised fine-tuning and hinting at extensions to multimodal settings.

Abstract

We present a semi-supervised fine-tuning framework for foundation models that utilises mutual information decomposition to address the challenges of training for a limited amount of labelled data. Our approach derives two distinct lower bounds: i) for the downstream task space, such as classification, optimised using conditional and marginal cross-entropy alongside Kullback-Leibler divergence, and ii) for the latent space representation, regularised and aligned using a contrastive-like decomposition. This fine-tuning strategy retains the pre-trained structure of the foundation model, modifying only a specialised projector module comprising a small transformer and a token aggregation technique. Experiments on several datasets demonstrate significant improvements in classification tasks under extremely low-labelled conditions by effectively leveraging unlabelled data.

Paper Structure

This paper contains 33 sections, 25 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Model architecture and training setup.
  • Figure 2: Ablation study of the classification accuracy for the SVHN (left) and MNIST (right) datasets. The baseline training uses categorical cross-entropy loss $\mathcal{L}^\text{cat-cross}$ only with the softmax rescaling. For SVHN, the number of unlabelled samples is 73'257 and the weights are set to $\lambda_C = 0.001$, $\lambda_L = 0.1$ and $\lambda_A = 0.1$. For MNIST, the number of unlabelled samples is 60'000 and the weights are set to $\lambda_C = 1.0$, $\lambda_L = 0.1$ and $\lambda_A = 0.1$.
  • Figure 3: Caption
  • Figure 4: All results for the SVHN dataset.
  • Figure 5: All results for the CIFAR10 dataset.
  • ...and 1 more figures