Table of Contents
Fetching ...

MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model

Xinyang Wang, Yi Yang, Minfeng Zhu, Kecheng Zheng, Shi Liu, Wei Chen

TL;DR

This work tackles the reliance of prompt tuning on a single global image representation in vision-language models. It introduces MePT, a three-branch framework—global, augmented, and vanilla image representations—coupled with a parameter-efficient self-ensemble to capture diverse visual cues and improve generalization. Through extensive experiments on base-to-novel generalization, cross-dataset transfer, and segmentation across 11 datasets, MePT consistently outperforms strong baselines and demonstrates notable gains on domain-shift tasks, especially when distribution gaps are large. The approach highlights the value of visual prompts in enriching image representations and provides a practical, robust path for adapting VLMs to diverse downstream scenarios.

Abstract

Recent advancements in pre-trained Vision-Language Models (VLMs) have highlighted the significant potential of prompt tuning for adapting these models to a wide range of downstream tasks. However, existing prompt tuning methods typically map an image to a single representation, limiting the model's ability to capture the diverse ways an image can be described. To address this limitation, we investigate the impact of visual prompts on the model's generalization capability and introduce a novel method termed Multi-Representation Guided Prompt Tuning (MePT). Specifically, MePT employs a three-branch framework that focuses on diverse salient regions, uncovering the inherent knowledge within images which is crucial for robust generalization. Further, we employ efficient self-ensemble techniques to integrate these versatile image representations, allowing MePT to learn all conditional, marginal, and fine-grained distributions effectively. We validate the effectiveness of MePT through extensive experiments, demonstrating significant improvements on both base-to-novel class prediction and domain generalization tasks.

MePT: Multi-Representation Guided Prompt Tuning for Vision-Language Model

TL;DR

This work tackles the reliance of prompt tuning on a single global image representation in vision-language models. It introduces MePT, a three-branch framework—global, augmented, and vanilla image representations—coupled with a parameter-efficient self-ensemble to capture diverse visual cues and improve generalization. Through extensive experiments on base-to-novel generalization, cross-dataset transfer, and segmentation across 11 datasets, MePT consistently outperforms strong baselines and demonstrates notable gains on domain-shift tasks, especially when distribution gaps are large. The approach highlights the value of visual prompts in enriching image representations and provides a practical, robust path for adapting VLMs to diverse downstream scenarios.

Abstract

Recent advancements in pre-trained Vision-Language Models (VLMs) have highlighted the significant potential of prompt tuning for adapting these models to a wide range of downstream tasks. However, existing prompt tuning methods typically map an image to a single representation, limiting the model's ability to capture the diverse ways an image can be described. To address this limitation, we investigate the impact of visual prompts on the model's generalization capability and introduce a novel method termed Multi-Representation Guided Prompt Tuning (MePT). Specifically, MePT employs a three-branch framework that focuses on diverse salient regions, uncovering the inherent knowledge within images which is crucial for robust generalization. Further, we employ efficient self-ensemble techniques to integrate these versatile image representations, allowing MePT to learn all conditional, marginal, and fine-grained distributions effectively. We validate the effectiveness of MePT through extensive experiments, demonstrating significant improvements on both base-to-novel class prediction and domain generalization tasks.
Paper Structure (41 sections, 10 equations, 6 figures, 7 tables)

This paper contains 41 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The illustration of multi-representation prompting. Given an image, we visualize the attention maps from the last layer of the vision transformer (Right). After visual-side prompt tuning (a), CLIP effectively optimizes the relevance signal for foreground objects compared to vanilla CLIP (b). The visual prompts tokens ([VP]s) further demonstrate foreground-focused capabilities and naturally attend to different objects in the scene with diversity (c). We propose a novel Multi-Representation Guided Prompt Tuning framework, designed to capture the comprehensive information inherent in the image with three branches.
  • Figure 2: Overview of our proposed MePT framework for multi-modal prompt tuning. CLIP encoders are utilized to generate image and text representations from the input image-text pairs. We introduce three-branch image presentations: global representation $\mathbf{x_p}$, augmented representation $\mathbf{\tilde{x}_p}$, and vanilla representation $\mathbf{x}$ to ensure comprehensive visual understanding across different domains. Additionally, we employ ground truth supervision ($\mathcal{L}_{Aug}$) in the augmented branch and constraints ($\mathcal{L}_{img}$ and $\mathcal{L}_{text}$) in the global branch to align with general knowledge. Moreover, the self-masked attention within visual prompts is employed to restrict the attention flow.
  • Figure 3: Comparison of foreground/background segmentation results of the special class token ([CLS]) and visual prompts tokens ([VP]s) with zero-shot CLIP baseline. The visual prompts are trained on ImageNet with 16-shot per base class.
  • Figure 4: Ablation study on prompt depth (left) and visual prompt length (right) over 11 datasets.
  • Figure 5: Comparison of foreground/background segmentation results between the GradCAM and raw attention map using the output embeddings derived from the special class token ([CLS]), relative to the zero-shot CLIP baseline. The visual prompts are trained on ImageNet with 16-shot per base class.
  • ...and 1 more figures