Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition
Mariia Drozdova, Vitaliy Kinakh, Yury Belousov, Erica Lastufka, Slava Voloshynovskiy
TL;DR
This work tackles distribution shift when applying pre-trained vision foundation models to downstream tasks with limited labels. It introduces a semi-supervised fine-tuning framework based on content-style decomposition of the latent [CLS] representation, using an information-theoretic objective with supervised cross-entropy and adversarial KL regularizers to disentangle content from style and reconstruct the latent token. Empirical evaluation across six datasets (including MNIST variants, SVHN, CIFAR-10, and GalaxyMNIST) shows consistent gains over supervised fine-tuning, with freezing performing best on simple tasks and fine-tuning becoming advantageous on more complex or shifted data, notably with RADIOv2. The approach mitigates distribution mismatch and enhances task-specific representations, enabling better deployment of vision foundation models in scientifically oriented, label-scarce settings.
Abstract
In this paper, we present a semi-supervised fine-tuning approach designed to improve the performance of pre-trained foundation models on downstream tasks with limited labeled data. By leveraging content-style decomposition within an information-theoretic framework, our method enhances the latent representations of pre-trained vision foundation models, aligning them more effectively with specific task objectives and addressing the problem of distribution shift. We evaluate our approach on multiple datasets, including MNIST, its augmented variations (with yellow and white stripes), CIFAR-10, SVHN, and GalaxyMNIST. The experiments show improvements over supervised finetuning baseline of pre-trained models, particularly in low-labeled data regimes, across both frozen and trainable backbones for the majority of the tested datasets.
