Towards Compatible Fine-tuning for Vision-Language Model Updates

Zhengbo Wang; Jian Liang; Lijun Sheng; Ran He; Zilei Wang; Tieniu Tan

Towards Compatible Fine-tuning for Vision-Language Model Updates

Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, Tieniu Tan

TL;DR

This work tackles the challenge that efficient fine-tuning modules for vision-language models often lose effectiveness when the base model is upgraded. It introduces ContCoOp, a shallow-layer method that injects class-conditioned prompts into the text encoder and uses an attention mechanism to fuse class information, enabling prompts to adapt to embedding shifts during upgrades. The method optimizes a joint loss $L = L_{ce} + \lambda L_{kd}$, where $L_{kd}$ distills zero-shot knowledge to improve transferability. Extensive experiments across 15 datasets demonstrate ContCoOp achieves superior compatibility in upgraded models (e.g., EVA-CLIP) and strong out-of-distribution generalization, outperforming baselines such as CoOp, CoCoOp, and KgCoOp on both CLIP and larger architectures like ViT-B/16 and ViT-L/14. This approach reduces retraining costs during model upgrades and holds promise for extending compatibility considerations to other modalities, including NLP.

Abstract

So far, efficient fine-tuning has become a popular strategy for enhancing the capabilities of foundation models on downstream tasks by learning plug-and-play modules. However, existing methods overlook a crucial issue: if the underlying foundation model is updated, are these plug-and-play modules still effective? In this paper, we first conduct a detailed analysis of various fine-tuning methods on the CLIP in terms of their compatibility with model updates. The study reveals that many high-performing fine-tuning methods fail to be compatible with the upgraded models. To address this, we propose a novel approach, Class-conditioned Context Optimization (ContCoOp), which integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder. Consequently, the prompts can dynamically adapt to the changes in embedding space (due to model updates), ensuring continued effectiveness. Extensive experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.

Towards Compatible Fine-tuning for Vision-Language Model Updates

TL;DR

, where

distills zero-shot knowledge to improve transferability. Extensive experiments across 15 datasets demonstrate ContCoOp achieves superior compatibility in upgraded models (e.g., EVA-CLIP) and strong out-of-distribution generalization, outperforming baselines such as CoOp, CoCoOp, and KgCoOp on both CLIP and larger architectures like ViT-B/16 and ViT-L/14. This approach reduces retraining costs during model upgrades and holds promise for extending compatibility considerations to other modalities, including NLP.

Abstract

Paper Structure (12 sections, 4 equations, 4 figures, 8 tables)

This paper contains 12 sections, 4 equations, 4 figures, 8 tables.

Introduction
Related Work
Method
Preliminary Study
Class-conditioned Context Optimization
Experiments
Setup
Main Results
Out-of-distribution Generalization
Different Architecture
Ablation Study
Conclusion

Figures (4)

Figure 1: Efficient fine-tuning methods enable us to easily train plug-and-play modules to enhance the performance of foundation models. However, a significant challenge arises due to the frequent updates in foundational models, such as variants of Llama and CLIP. Re-training these modules for the upgraded model incurs significant costs. Therefore, our investigation aims to address the question of whether these modules can be compatible with the upgraded model.
Figure 2: (a) The average performance over 11 datasets. We assess the compatibility of efficient fine-tuning methods of VLMs for model upgrades. These methods are trained on ViT-B/16-based CLIP and then integrated into the corresponding EVA-CLIP, the upgraded model. The terms Base and New represent the performance tested after inserting these modules into CLIP and EVA-CLIP, with H indicating their harmonic average. $\dagger$ denotes that is a deep-layer method. (b) The average absolute and relative changes in parameters at each layer of the text encoder before and after model upgrading. (c) The average absolute and relative changes in output features at each layer of the text encoder before and after model upgrading. (d) To mitigate methodological impacts, we train CoOp at different layers on the DTD dataset and report its accuracy on the upgraded model. The results indicate that shallower layers exhibit superior transferbility compared to deeper layers.
Figure 3: The overview of our method. To enhance the compatibility, we aim for our modules to dynamically adapt to model updates. For this, we adopt class-conditioned learnable prompts. Leveraging the attention network, our method integrates class information into learnable prompts. Moreover, following model updates, the prompts also undergo automatic updates, synchronized with changes in class embeddings. We include CoOp and CoCoOp for comparison.
Figure 4: The impact of different context length.

Towards Compatible Fine-tuning for Vision-Language Model Updates

TL;DR

Abstract

Towards Compatible Fine-tuning for Vision-Language Model Updates

Authors

TL;DR

Abstract

Table of Contents

Figures (4)