Lipsum-FT: Robust Fine-Tuning of Zero-Shot Models Using Random Text Guidance
Giung Nam, Byeongho Heo, Juho Lee
TL;DR
This work addresses the robustness-vs-fine-tuning trade-off in zero-shot vision-language models by introducing Lipsum-FT, a regularization that minimizes the energy-gap between the pre-trained zero-shot model and its fine-tuned successor using random text guidance. By framing CLIP-style models within a joint energy-based model, the authors show that standard fine-tuning can distort vision-language alignment, and that reducing the energy gap correlates with improved robustness to distribution shifts. Lipsum-FT adds a random-text-based regularizer, $\hat{{\mathcal{R}}}(\boldsymbol{\theta}) = \frac{1}{2M} \lVert \boldsymbol{v}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\boldsymbol{x}) - \boldsymbol{v}_{\boldsymbol{\theta}_0,\boldsymbol{\phi}}(\boldsymbol{x}) \rVert_2^2$, where $\boldsymbol{v}^{(m)} = \langle \mathcal{G}_{\boldsymbol{\phi}}(\boldsymbol{t}_m), \mathcal{F}_{\boldsymbol{\theta}}(\boldsymbol{x})\rangle$ and $\boldsymbol{t}_m$ are random tokens, thereby enforcing energy-consistency with the zero-shot model. Empirical results on DomainNet and ImageNet show that Lipsum-FT achieves state-of-the-art robustness under distribution shifts and improves uncertainty quantification, while remaining compatible with post-hoc methods like WiSE and TPGM. The work advances robust fine-tuning by integrating the language-model component into the regularization objective and suggests future work on tailored text guidance strategies.
Abstract
Large-scale contrastive vision-language pre-trained models provide the zero-shot model achieving competitive performance across a range of image classification tasks without requiring training on downstream data. Recent works have confirmed that while additional fine-tuning of the zero-shot model on the reference data results in enhanced downstream performance, it compromises the model's robustness against distribution shifts. Our investigation begins by examining the conditions required to achieve the goals of robust fine-tuning, employing descriptions based on feature distortion theory and joint energy-based models. Subsequently, we propose a novel robust fine-tuning algorithm, Lipsum-FT, that effectively utilizes the language modeling aspect of the vision-language pre-trained models. Extensive experiments conducted on distribution shift scenarios in DomainNet and ImageNet confirm the superiority of our proposed Lipsum-FT approach over existing robust fine-tuning methods.
