Table of Contents
Fetching ...

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Mushui Liu, Bozheng Li, Yunlong Yu

TL;DR

The paper tackles the fragility of prompt-tuning CLIP-style VLMs when transferring to new domains under few-shot supervision by proposing CLIP-CITE, a full fine-tuning framework with targeted safeguards. It introduces a discriminative visual-text alignment task, a supervised contrastive loss, and a vision-language similarity distillation term to guide task-specific adaptation without catastrophic forgetting. Across 11 diverse datasets and four task settings (FSL, DG, BNG, CDG), CLIP-CITE shows strong gains in base-class accuracy, open-vocabulary generalization, and cross-domain robustness, outperforming or matching prompt-based baselines. The results indicate that carefully designed full fine-tuning can achieve domain-specific specialization while preserving versatility, offering practical gains in few-shot VLM deployment and transfer learning.

Abstract

Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks. However, they often come at the cost of flexibility and adaptability when the tuned models are applied to different datasets or domains. In this paper, we explore capturing the task-specific information via meticulous refinement of entire VLMs, with minimal parameter adjustments. When fine-tuning the entire VLMs for specific tasks under limited supervision, overfitting and catastrophic forgetting become the defacto factors. To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task, further aligning the visual-text semantics in a supervision manner, and integrating knowledge distillation techniques to preserve the gained knowledge. Extensive experimental results under few-shot learning, base-to-new generalization, domain generalization, and cross-domain generalization settings, demonstrate that our method effectively enhances the performance on specific tasks under limited supervision while preserving the versatility of the VLMs on other datasets.

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

TL;DR

The paper tackles the fragility of prompt-tuning CLIP-style VLMs when transferring to new domains under few-shot supervision by proposing CLIP-CITE, a full fine-tuning framework with targeted safeguards. It introduces a discriminative visual-text alignment task, a supervised contrastive loss, and a vision-language similarity distillation term to guide task-specific adaptation without catastrophic forgetting. Across 11 diverse datasets and four task settings (FSL, DG, BNG, CDG), CLIP-CITE shows strong gains in base-class accuracy, open-vocabulary generalization, and cross-domain robustness, outperforming or matching prompt-based baselines. The results indicate that carefully designed full fine-tuning can achieve domain-specific specialization while preserving versatility, offering practical gains in few-shot VLM deployment and transfer learning.

Abstract

Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks. However, they often come at the cost of flexibility and adaptability when the tuned models are applied to different datasets or domains. In this paper, we explore capturing the task-specific information via meticulous refinement of entire VLMs, with minimal parameter adjustments. When fine-tuning the entire VLMs for specific tasks under limited supervision, overfitting and catastrophic forgetting become the defacto factors. To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task, further aligning the visual-text semantics in a supervision manner, and integrating knowledge distillation techniques to preserve the gained knowledge. Extensive experimental results under few-shot learning, base-to-new generalization, domain generalization, and cross-domain generalization settings, demonstrate that our method effectively enhances the performance on specific tasks under limited supervision while preserving the versatility of the VLMs on other datasets.
Paper Structure (5 sections, 7 figures, 6 tables)

This paper contains 5 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 2: FSL Comparison results of our CILP-CITE and four competitors on the 11 datasets. All of the methods are trained on the ViT-B/16 backbone and implemented with the same experimental settings. We report the average performance of 11 datasets.
  • Figure 3: Ablation on different fine-tuning parts of the model.
  • Figure 4: Comparison results with the different ensemble ratio $\alpha$.
  • Figure 5: Training loss and accuracy of FT-Probe and CLIP-CITE on the EuroSAT dataset.
  • Figure 6: Impacts (%) of the hyper-parameter $\lambda$ and $\eta$ on the BNG performances. We report the results on the ImageNet dataset.
  • ...and 2 more figures