Table of Contents
Fetching ...

Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance

Giung Nam, Byeongho Heo, Juho Lee

TL;DR

This work addresses the robustness-vs-fine-tuning trade-off in zero-shot vision-language models by introducing Lipsum-FT, a regularization that minimizes the energy-gap between the pre-trained zero-shot model and its fine-tuned successor using random text guidance. By framing CLIP-style models within a joint energy-based model, the authors show that standard fine-tuning can distort vision-language alignment, and that reducing the energy gap correlates with improved robustness to distribution shifts. Lipsum-FT adds a random-text-based regularizer, $\hat{{\mathcal{R}}}(\boldsymbol{\theta}) = \frac{1}{2M} \lVert \boldsymbol{v}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\boldsymbol{x}) - \boldsymbol{v}_{\boldsymbol{\theta}_0,\boldsymbol{\phi}}(\boldsymbol{x}) \rVert_2^2$, where $\boldsymbol{v}^{(m)} = \langle \mathcal{G}_{\boldsymbol{\phi}}(\boldsymbol{t}_m), \mathcal{F}_{\boldsymbol{\theta}}(\boldsymbol{x})\rangle$ and $\boldsymbol{t}_m$ are random tokens, thereby enforcing energy-consistency with the zero-shot model. Empirical results on DomainNet and ImageNet show that Lipsum-FT achieves state-of-the-art robustness under distribution shifts and improves uncertainty quantification, while remaining compatible with post-hoc methods like WiSE and TPGM. The work advances robust fine-tuning by integrating the language-model component into the regularization objective and suggests future work on tailored text guidance strategies.

Abstract

Large-scale contrastive vision-language pre-trained models provide the zero-shot model achieving competitive performance across a range of image classification tasks without requiring training on downstream data. Recent works have confirmed that while additional fine-tuning of the zero-shot model on the reference data results in enhanced downstream performance, it compromises the model's robustness against distribution shifts. Our investigation begins by examining the conditions required to achieve the goals of robust fine-tuning, employing descriptions based on feature distortion theory and joint energy-based models. Subsequently, we propose a novel robust fine-tuning algorithm, Lipsum-FT, that effectively utilizes the language modeling aspect of the vision-language pre-trained models. Extensive experiments conducted on distribution shift scenarios in DomainNet and ImageNet confirm the superiority of our proposed Lipsum-FT approach over existing robust fine-tuning methods.

Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance

TL;DR

This work addresses the robustness-vs-fine-tuning trade-off in zero-shot vision-language models by introducing Lipsum-FT, a regularization that minimizes the energy-gap between the pre-trained zero-shot model and its fine-tuned successor using random text guidance. By framing CLIP-style models within a joint energy-based model, the authors show that standard fine-tuning can distort vision-language alignment, and that reducing the energy gap correlates with improved robustness to distribution shifts. Lipsum-FT adds a random-text-based regularizer, , where and are random tokens, thereby enforcing energy-consistency with the zero-shot model. Empirical results on DomainNet and ImageNet show that Lipsum-FT achieves state-of-the-art robustness under distribution shifts and improves uncertainty quantification, while remaining compatible with post-hoc methods like WiSE and TPGM. The work advances robust fine-tuning by integrating the language-model component into the regularization objective and suggests future work on tailored text guidance strategies.

Abstract

Large-scale contrastive vision-language pre-trained models provide the zero-shot model achieving competitive performance across a range of image classification tasks without requiring training on downstream data. Recent works have confirmed that while additional fine-tuning of the zero-shot model on the reference data results in enhanced downstream performance, it compromises the model's robustness against distribution shifts. Our investigation begins by examining the conditions required to achieve the goals of robust fine-tuning, employing descriptions based on feature distortion theory and joint energy-based models. Subsequently, we propose a novel robust fine-tuning algorithm, Lipsum-FT, that effectively utilizes the language modeling aspect of the vision-language pre-trained models. Extensive experiments conducted on distribution shift scenarios in DomainNet and ImageNet confirm the superiority of our proposed Lipsum-FT approach over existing robust fine-tuning methods.
Paper Structure (17 sections, 11 equations, 16 figures, 11 tables)

This paper contains 17 sections, 11 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Overview. We first verify that standard fine-tuning declines the vision-language connections in the pre-trained CLIP model, as evidenced by changes in the energy function $E_{{\boldsymbol{\theta}},{\boldsymbol{\phi}}}$ after $T$ fine-tuning steps, denoted as ${\boldsymbol{\theta}}_{0} \rightarrow {\boldsymbol{\theta}}_{T}$ (\ref{['subsection/robust_ft']}). Subsequently, we propose a simple yet effective novel robust fine-tuning method, Lipsum-FT, which regularizes the ${\operatorname{EnergyGap}({\boldsymbol{\theta}}_{T},{\boldsymbol{\theta}}_{0})}$ easily derived from language model outputs for random texts during fine-tuning (\ref{['subsection/lipsum_ft']}).
  • Figure 2: Trade-off plots on DomainNet. In plots, the vertical axis shows accuracy on the reference data, whereas the horizontal axis indicates average accuracy on distribution shifts, i.e., the top-right represents a better case. The star markers $\medstar$ correspond to the zero-shot model, while the square markers $\medsquare$ denote the model fine-tuned with 5000 training steps and a learning rate of 1e-05. Left: The number of training steps is kept at 5000, while the learning rate varies as 1e-06, 3e-06, 1e-05, and 3e-05 (the leftmost point denotes 3e-05). Right: The learning rate is constant at 1e-05, while the number of training steps is altered to 1000, 3000, 5000, and 10000 (the leftmost point denotes 10000). Refer to \ref{['figure/plot_finetuned_inet']} for the results on ImageNet.
  • Figure 3: Bar plots depicting the feature distortion on DomainNet. A taller bar represents a larger degree of feature distortion after fine-tuning. The number on the bar denotes the relative difference in distortion values between reference and distribution shift data. Left: The number of training steps is kept at 5000, while the learning rate varies as 1e-06, 3e-06, 1e-05, and 3e-05. Right: The learning rate is constant at 1e-05, while the number of training steps is altered to 1000, 3000, 5000, and 10000. These plots are with B/16 on DomainNet, and refer to \ref{['figure/plot_finetuned_dnet_distortion', 'figure/plot_finetuned_inet_distortion']} for the results with B/32, B/16, and L/14 on DomainNet and ImageNet.
  • Figure 4: Energy gaps and distribution shift accuracy on DomainNet. It represents the energy gap (y-axis) and the relative accuracy of distribution shift data to the reference accuracy (x-axis). The model attained through the fine-tuning procedure is represented by square markers, while the model obtained through post-hoc approaches combining the fine-tuned and zero-shot models is denoted by diamond-shaped markers (i.e., WiSE and TPGM). The dashed line depicts a linear trend, and it comes with specific details provided as the Pearson correlation coefficient, denoted as PCC. We refer readers to \ref{['figure/plot_gap_inet']} for the results on ImageNet.
  • Figure 5: Scatter plots for post-hoc methods on DomainNet. Moving in an upward and rightward direction signifies improved accuracy for the reference and distribution shift data, respectively, making the top right corner more desirable. We refer readers to \ref{['figure/plot_lipsum_inet_wise']} for ImageNet results, as well as \ref{['table/wise_results_dnet', 'table/wise_results_inet']} for numerical results.
  • ...and 11 more figures