Table of Contents
Fetching ...

RESTORE: Towards Feature Shift for Vision-Language Prompt Learning

Yuncheng Yang, Chuyan Zhang, Zuopeng Yang, Yuting Gao, Yulei Qin, Ke Li, Xing Sun, Jie Yang, Yun Gu

TL;DR

This work tackles the generalization gap in vision–language prompt tuning by introducing feature shift as a diagnostic and regularization tool for cross-modal alignment. RESTORE couples a feature-shift consistency loss with a dynamic surgery adapter to synchronize updates across vision and language branches while correcting large shifts in output representations. Across 11 datasets in few-shot settings, RESTORE consistently outperforms state-of-the-art baselines in base-to-novel and cross-domain evaluations, demonstrating improved generalization without sacrificing alignment. The findings highlight the importance of coordinated modality updates for robust VLM adaptation and offer practical mechanisms for maintaining pre-training cross-modal constraints during downstream tuning.

Abstract

Prompt learning is effective for fine-tuning foundation models to improve their generalization across a variety of downstream tasks. However, the prompts that are independently optimized along a single modality path, may sacrifice the vision-language alignment of pre-trained models in return for improved performance on specific tasks and classes, leading to poorer generalization. In this paper, we first demonstrate that prompt tuning along only one single branch of CLIP (e.g., language or vision) is the reason why the misalignment occurs. Without proper regularization across the learnable parameters in different modalities, prompt learning violates the original pre-training constraints inherent in the two-tower architecture. To address such misalignment, we first propose feature shift, which is defined as the variation of embeddings after introducing the learned prompts, to serve as an explanatory tool. We dive into its relation with generalizability and thereafter propose RESTORE, a multi-modal prompt learning method that exerts explicit constraints on cross-modal consistency. To be more specific, to prevent feature misalignment, a feature shift consistency is introduced to synchronize inter-modal feature shifts by measuring and regularizing the magnitude of discrepancy during prompt tuning. In addition, we propose a "surgery" block to avoid short-cut hacking, where cross-modal misalignment can still be severe if the feature shift of each modality varies drastically at the same rate. It is implemented as feed-forward adapters upon both modalities to alleviate the misalignment problem. Extensive experiments on 15 datasets demonstrate that our method outperforms the state-of-the-art prompt tuning methods without compromising feature alignment.

RESTORE: Towards Feature Shift for Vision-Language Prompt Learning

TL;DR

This work tackles the generalization gap in vision–language prompt tuning by introducing feature shift as a diagnostic and regularization tool for cross-modal alignment. RESTORE couples a feature-shift consistency loss with a dynamic surgery adapter to synchronize updates across vision and language branches while correcting large shifts in output representations. Across 11 datasets in few-shot settings, RESTORE consistently outperforms state-of-the-art baselines in base-to-novel and cross-domain evaluations, demonstrating improved generalization without sacrificing alignment. The findings highlight the importance of coordinated modality updates for robust VLM adaptation and offer practical mechanisms for maintaining pre-training cross-modal constraints during downstream tuning.

Abstract

Prompt learning is effective for fine-tuning foundation models to improve their generalization across a variety of downstream tasks. However, the prompts that are independently optimized along a single modality path, may sacrifice the vision-language alignment of pre-trained models in return for improved performance on specific tasks and classes, leading to poorer generalization. In this paper, we first demonstrate that prompt tuning along only one single branch of CLIP (e.g., language or vision) is the reason why the misalignment occurs. Without proper regularization across the learnable parameters in different modalities, prompt learning violates the original pre-training constraints inherent in the two-tower architecture. To address such misalignment, we first propose feature shift, which is defined as the variation of embeddings after introducing the learned prompts, to serve as an explanatory tool. We dive into its relation with generalizability and thereafter propose RESTORE, a multi-modal prompt learning method that exerts explicit constraints on cross-modal consistency. To be more specific, to prevent feature misalignment, a feature shift consistency is introduced to synchronize inter-modal feature shifts by measuring and regularizing the magnitude of discrepancy during prompt tuning. In addition, we propose a "surgery" block to avoid short-cut hacking, where cross-modal misalignment can still be severe if the feature shift of each modality varies drastically at the same rate. It is implemented as feed-forward adapters upon both modalities to alleviate the misalignment problem. Extensive experiments on 15 datasets demonstrate that our method outperforms the state-of-the-art prompt tuning methods without compromising feature alignment.
Paper Structure (20 sections, 11 equations, 5 figures, 5 tables)

This paper contains 20 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The pre-trained VLM (e.g., CLIP) demonstrates a strong generalizability by its zero-shot performance on both base and novel classes. However, existing prompt learning methods over-emphasize performance gains on the seen base classes while ignoring their declining generalization on novel ones, which is demonstrated by the decreased probability $P(c|x)$ of the ground-truth category and the overall accuracy.
  • Figure 2: The overall workflow of our multi-modal prompt tuning. During fine-tuning, we fix the parameters of the encoder backbones unchanged. $K$ different textual descriptions are prompted to represent $K$ categories and are encoded by the text encoder into the embedding space. Similarly, $M$ images are encoded by the image encoder into the visual embedding space. The classification is carried out by measuring the similarity between visual and textual representations. In both vision and language encoders, multiple learnable prompts are equipped to interfere with the embeddings independently. To establish connections between prompts from different modalities, we take feature shift as a bridge and synchronize the cross-modal representation update. In consideration of the risk of task-specific overfitting, the "surgery" block is applied to effectively penalize severe deviation of prompt-tuned features from their pre-trained counterparts, preserving the valuable intrinsic knowledge.
  • Figure 3: The negatively associated relationship is observed between the inter-modal discrepancy of feature shift and the performance gains. Compared with the zero-shot CLIP, existing single-modal and multi-modal prompt tuning methods achieve superior and inferior performance respectively on base and novel categories. They "unintentionally" encourage the inter-modal discrepancy of feature shift during fine-tuning, consequently leading to a loss of generalization capabilities for downstream tasks.
  • Figure 4: Average feature shift and the according performance for different methods. IVLP, VPT, LPT, and IVLP+$\mathcal{L}^{fs}$ represent independent vision-language prompt tuning, vision prompt tuning, language prompt tuning, and IVLP with feature shift loss. The introduction of prompt tuning in a single branch causes severe feature shifts, leading to final feature misalignment and degradation of performance. However, the introduction of our feature shift loss can reduce such kind of modality misalignment, therefore causing superior performance.
  • Figure 5: T-SNE visualization of features on base and novel classes after training with CoOp, our method and zero-shot CLIP. Different colored dots in the figure represent different categories, with smaller dots representing t-SNE visualizations of image features and larger dots representing t-SNE visualizations of text features. The zero-shot CLIP performs poorly on the base class, while the fine-tuned CoOp performs poorly on the new category (see the red box in the figure). However, after introducing cross-modal constraints and an adapter to alleviate feature collapse, our method performs very well on both the base class and the novel class.