Table of Contents
Fetching ...

Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition

Mariia Drozdova, Vitaliy Kinakh, Yury Belousov, Erica Lastufka, Slava Voloshynovskiy

TL;DR

This work tackles distribution shift when applying pre-trained vision foundation models to downstream tasks with limited labels. It introduces a semi-supervised fine-tuning framework based on content-style decomposition of the latent [CLS] representation, using an information-theoretic objective with supervised cross-entropy and adversarial KL regularizers to disentangle content from style and reconstruct the latent token. Empirical evaluation across six datasets (including MNIST variants, SVHN, CIFAR-10, and GalaxyMNIST) shows consistent gains over supervised fine-tuning, with freezing performing best on simple tasks and fine-tuning becoming advantageous on more complex or shifted data, notably with RADIOv2. The approach mitigates distribution mismatch and enhances task-specific representations, enabling better deployment of vision foundation models in scientifically oriented, label-scarce settings.

Abstract

In this paper, we present a semi-supervised fine-tuning approach designed to improve the performance of pre-trained foundation models on downstream tasks with limited labeled data. By leveraging content-style decomposition within an information-theoretic framework, our method enhances the latent representations of pre-trained vision foundation models, aligning them more effectively with specific task objectives and addressing the problem of distribution shift. We evaluate our approach on multiple datasets, including MNIST, its augmented variations (with yellow and white stripes), CIFAR-10, SVHN, and GalaxyMNIST. The experiments show improvements over supervised finetuning baseline of pre-trained models, particularly in low-labeled data regimes, across both frozen and trainable backbones for the majority of the tested datasets.

Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition

TL;DR

This work tackles distribution shift when applying pre-trained vision foundation models to downstream tasks with limited labels. It introduces a semi-supervised fine-tuning framework based on content-style decomposition of the latent [CLS] representation, using an information-theoretic objective with supervised cross-entropy and adversarial KL regularizers to disentangle content from style and reconstruct the latent token. Empirical evaluation across six datasets (including MNIST variants, SVHN, CIFAR-10, and GalaxyMNIST) shows consistent gains over supervised fine-tuning, with freezing performing best on simple tasks and fine-tuning becoming advantageous on more complex or shifted data, notably with RADIOv2. The approach mitigates distribution mismatch and enhances task-specific representations, enabling better deployment of vision foundation models in scientifically oriented, label-scarce settings.

Abstract

In this paper, we present a semi-supervised fine-tuning approach designed to improve the performance of pre-trained foundation models on downstream tasks with limited labeled data. By leveraging content-style decomposition within an information-theoretic framework, our method enhances the latent representations of pre-trained vision foundation models, aligning them more effectively with specific task objectives and addressing the problem of distribution shift. We evaluate our approach on multiple datasets, including MNIST, its augmented variations (with yellow and white stripes), CIFAR-10, SVHN, and GalaxyMNIST. The experiments show improvements over supervised finetuning baseline of pre-trained models, particularly in low-labeled data regimes, across both frozen and trainable backbones for the majority of the tested datasets.
Paper Structure (22 sections, 6 equations, 17 figures, 7 tables)

This paper contains 22 sections, 6 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Architecture of the proposed semi-supervised fine-tuning scheme. The vision foundation model generates a representation in the form of a [CLS] token $\tilde{\mathbf{y}}_x$, which is decomposed into content attribute label $\hat{\mathbf{c}}_{a_x}$ and generic style $\hat{\mathbf{s}}_{a_x}$. These are then used for targeted reconstruction of the [CLS] token $\hat{\mathbf{y}}_x$.
  • Figure 2: Samples per row from each dataset. From top to bottom: MNIST, MNIST with yellow stripes, MNIST with white stripes, SVHN, CIFAR-10, GalaxyMNIST.
  • Figure 3: RADIOv2 model results. The y-axis represents the best error rate (1 - accuracy), and the x-axis represents the number of labeled samples. For each dataset, the classifier is trained with both supervised learning and our proposed method for frozen and trainable backbone.
  • Figure 4: DINOv2 model results. The y-axis represents the best error rate (1 - accuracy), and the x-axis represents the number of labeled samples. For each dataset, the classifier is trained with both supervised learning and our proposed method for frozen and trainable backbone.
  • Figure 5: CLIP model results. The y-axis represents the best error rate (1 - accuracy), and the x-axis represents the number of labeled samples. For each dataset, the classifier is trained with both supervised learning and our proposed method for frozen and trainable backbone.
  • ...and 12 more figures