Table of Contents
Fetching ...

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model

Shuailei Ma, Chen-Wei Xie, Ying Wei, Siyang Sun, Jiaqi Fan, Xiaoyi Bao, Yuxin Guo, Yun Zheng

TL;DR

This work probes the mechanisms by which multi-modal prompts adapt pre-trained vision-language models. Through attention-statistics, alignment analyses, and visualization across 11 datasets, it demonstrates that prompts primarily act as dataset biases rather than altering feature extraction, with text prompts shifting the language branch's reliance toward dataset-specific cues and vision prompts resembling overlooked background features. The authors introduce bias tuning, a method that injects learnable biases directly into transformer blocks, which outperforms prompt tuning with the same parameter budget, validating the central role of bias in prompt efficacy. These findings provide a principled view of prompt-based adaptation and suggest directions for designing more robust, bias-aware multimodal prompts for downstream tasks.

Abstract

Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. However, there is no work that provides a comprehensive explanation for the working mechanism of the multi-modal prompts. In this paper, we conduct a direct analysis of the multi-modal prompts by asking the following questions: $(i)$ How do the learned multi-modal prompts improve the recognition performance? $(ii)$ What do the multi-modal prompts learn? To answer these questions, we begin by isolating the component of the formula where the prompt influences the calculation of self-attention at each layer in two distinct ways, \ie, $(1)$ introducing prompt embeddings makes the $[cls]$ token focus on foreground objects. $(2)$ the prompts learn a bias term during the update of token embeddings, allowing the model to adapt to the target domain. Subsequently, we conduct extensive visualization and statistical experiments on the eleven diverse downstream recognition datasets. From the experiments, we reveal that the learned prompts improve the performance mainly through the second way, which acts as the dataset bias to improve the recognition performance of the pre-trained model on the corresponding dataset. Meanwhile, we propose the bias tuning way to validate our finding. With a deeper understanding of the multi-modal prompt, we hope our work can inspire new and solid research in this direction.

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model

TL;DR

This work probes the mechanisms by which multi-modal prompts adapt pre-trained vision-language models. Through attention-statistics, alignment analyses, and visualization across 11 datasets, it demonstrates that prompts primarily act as dataset biases rather than altering feature extraction, with text prompts shifting the language branch's reliance toward dataset-specific cues and vision prompts resembling overlooked background features. The authors introduce bias tuning, a method that injects learnable biases directly into transformer blocks, which outperforms prompt tuning with the same parameter budget, validating the central role of bias in prompt efficacy. These findings provide a principled view of prompt-based adaptation and suggest directions for designing more robust, bias-aware multimodal prompts for downstream tasks.

Abstract

Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. However, there is no work that provides a comprehensive explanation for the working mechanism of the multi-modal prompts. In this paper, we conduct a direct analysis of the multi-modal prompts by asking the following questions: How do the learned multi-modal prompts improve the recognition performance? What do the multi-modal prompts learn? To answer these questions, we begin by isolating the component of the formula where the prompt influences the calculation of self-attention at each layer in two distinct ways, \ie, introducing prompt embeddings makes the token focus on foreground objects. the prompts learn a bias term during the update of token embeddings, allowing the model to adapt to the target domain. Subsequently, we conduct extensive visualization and statistical experiments on the eleven diverse downstream recognition datasets. From the experiments, we reveal that the learned prompts improve the performance mainly through the second way, which acts as the dataset bias to improve the recognition performance of the pre-trained model on the corresponding dataset. Meanwhile, we propose the bias tuning way to validate our finding. With a deeper understanding of the multi-modal prompt, we hope our work can inspire new and solid research in this direction.
Paper Structure (17 sections, 9 equations, 6 figures, 1 table)

This paper contains 17 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overscheme of the multi-modal prompts in the pre-trained model. In each layer, the fixed prompts provide additional attention value for each token. In the first layer, the attention between tokens undergoes a proportional transformation ($\frac{1}{2}\times$). However, the relative values of attention between tokens are changed in the subsequent layers. Detailed explanation in Sec.\ref{['3.2.1']}.
  • Figure 2: Statistics results on the contribution of each token of the language branch to the alignment task for the OxfordPet dataset. We conduct the statistical analysis on samples where the zero-shot CLIP misidentify, while the independent prompt CLIP identify correctly. The corresponding two realizations are averaged over the statistics for the entire dataset, where the blue and red regions represent the statistics for zero-shot and prompt tuning, respectively. On the left side of the red vertical line, we show the contribution of $t_{SOS}$ and four learned text prompts $\bm{P}_{t}$. On the right side, we show the $\bm{c}_{k}$ tokens and padding tokens. The horizontal coordinate is the index of the token and the vertical coordinate is the relative value of the contribution. For the zero-shot clip, we use the temple: "a photo of [Category].". For the tokens after the $\bm{t}_{EOS}$, we set their contribution as 0 for intuitive comparison. Meanwhile, we zoom in on comparing the index at which the prompt is placed in the middle of each figure. The statistics results of other datasets are detailedly shown in the Appendix.
  • Figure 3: Statistics results on the contribution of each patch of the image branch to the alignment task for the OxfordPet dataset. We conduct the statistical analysis on samples where the zero-shot CLIP misidentify, while the independent prompt CLIP identify correctly. The corresponding two realizations are averaged over the statistics for the entire dataset, where the blue and red regions represent the statistics for zero-shot and prompt tuning, respectively. On the left side of the red vertical line, we show the contribution of the input patch tokens $\bm{\Tilde{X}_p}$. On the right side, we show the learned vision prompts $\bm{P}_{v}$. The horizontal coordinate is the index of the token and the vertical coordinate is the relative value of the contribution. For the zero-shot clip, the values of the 196th-199th index are empty due to the absence of the vision prompts. Meanwhile, we zoom in on comparing the index at which the prompt is placed in the middle of each figure. The statistics results of other datasets are detailedly shown in the Appendix.
  • Figure 4: Statistics results on the $[cls]$ token attention for the ImageNet dataset. The first row is the attention distribution of CLIP, where the prompt denotes 'a photo of a'. The second row is the feature distribution after adding the vision prompts, where the prompt represents the learnable token. Each graph contains the attention distribution of 12 transformer blocks, and each data layer includes 8 attention heads. sos, prompt, category and '.' denote the $[cls]$ token's attention on the corresponding token.
  • Figure 5: The averaged attention distance and entropy in different attention heads (dots). For the CLIP, we statistic the averaged attention distance and entropy of the original CLIP model. For the W/O Prompt, we statistic the averaged attention distance and entropy of the CLIP model after adding the prompts, but we remove the attention distance of the prompts. For the W Prompt, the attention distance and entropy of the prompts are retained.
  • ...and 1 more figures