Table of Contents
Fetching ...

Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

Sungyeon Kim, Boseung Jeong, Donghyun Kim, Suha Kwak

TL;DR

This work tackles the challenge of robust yet efficient fine-tuning of zero-shot vision–language models. It introduces Robust Adapter (R-Adapter), which integrates lightweight adapters with three self-ensemble strategies to boost OOD robustness while tuning only a small fraction of parameters, and a Multi-Positive Margin NCE (MPM-NCE) loss to better align multi-positive image–text pairs. The approach extends robust fine-tuning beyond classification to cross-modal retrieval and open-vocabulary segmentation, achieving state-of-the-art results across ID and multiple OOD datasets with significantly fewer trainable parameters than prior methods. The combination of weight-space re-parameterization, adapter dropping, and temporal accumulation enables a single-model ensemble effect without extra storage, while MPM-NCE provides discriminative, multi-positive alignment. Empirically, R-Adapter delivers consistent gains in robustness and efficiency across ImageNet shifts, few-shot settings, cross-modal retrieval, and OVSeg tasks, demonstrating broad applicability and practical impact for scalable fine-tuning of large vision–language models.

Abstract

Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval and open vocabulary segmentation, we demonstrate the broad applicability of R-Adapter. Our extensive experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.

Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

TL;DR

This work tackles the challenge of robust yet efficient fine-tuning of zero-shot vision–language models. It introduces Robust Adapter (R-Adapter), which integrates lightweight adapters with three self-ensemble strategies to boost OOD robustness while tuning only a small fraction of parameters, and a Multi-Positive Margin NCE (MPM-NCE) loss to better align multi-positive image–text pairs. The approach extends robust fine-tuning beyond classification to cross-modal retrieval and open-vocabulary segmentation, achieving state-of-the-art results across ID and multiple OOD datasets with significantly fewer trainable parameters than prior methods. The combination of weight-space re-parameterization, adapter dropping, and temporal accumulation enables a single-model ensemble effect without extra storage, while MPM-NCE provides discriminative, multi-positive alignment. Empirically, R-Adapter delivers consistent gains in robustness and efficiency across ImageNet shifts, few-shot settings, cross-modal retrieval, and OVSeg tasks, demonstrating broad applicability and practical impact for scalable fine-tuning of large vision–language models.

Abstract

Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval and open vocabulary segmentation, we demonstrate the broad applicability of R-Adapter. Our extensive experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.
Paper Structure (23 sections, 11 equations, 4 figures, 12 tables)

This paper contains 23 sections, 11 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: We present Robust Adapter (R-Adapter), which combines the strengths of robust fine-tuning and parameter-efficient fine-tuning (PEFT). R-Adapter improves parameter and memory efficiency compared to existing robust fine-tuning (e.g., Mask-fill maskfill, ModelSoup modelsoup) while being more robust compared to existing PEFT (e.g., AdaptFormer adaptformer, MaPLe maple). Unlike most of existing robust fine-tuning, our method can apply to a wide range of tasks, and consistently outperforms current best methods on diverse tasks in both in-distribution (ID) and out-of-distribution (OOD).
  • Figure 2: An overview of R-Adapter. Each adapter is positioned after MHA and FFN layers. R-Adapter stochastically drops the adapters during training. Also, the weights of the adapters are accumulated using an exponential moving average during the training. At the evaluation, these weights are re-scaled by $\alpha$ and then re-parametrized to be integrated into their prior layers, resulting in a weight-space ensemble between the pre-trained layers and the re-parametrized layer without re-scaling.
  • Figure 3: Performance of our method varying re-scaling coefficient $\alpha$ against WiSE-FT.
  • Figure 4: Performance of our method varying re-scaling coefficient $\alpha$ in Eq. 9. The accuracy of each Cross-modal Retrieval is the sum of the performances in recall@K for Image retrieval (R@1, R@5, R@10) and the performances in recall@K for text retrieval (R@1, R@5, R@10). The accuracy of open vocabulary segmentation is the average of mIOU of 5 standard datasets.