Progressive Multi-modal Conditional Prompt Tuning

Xiaoyu Qiu; Hao Feng; Yuechen Wang; Wengang Zhou; Houqiang Li

Progressive Multi-modal Conditional Prompt Tuning

Xiaoyu Qiu, Hao Feng, Yuechen Wang, Wengang Zhou, Houqiang Li

TL;DR

ProMPT tackles the gap between vision and language representations in pre-trained vision-language models by introducing a Progressive Multi-modal Conditional Prompt Tuning framework. It combines an initialization phase with a multi-modal iterative evolution module that injects class-conditional vision prompts and instance-conditional text prompts, guided by feature filtering, to progressively align V-L features and refine predictions from coarse to precise. The approach achieves superior base-to-novel generalization, cross-dataset transfer, and domain robustness compared with uni-modal prompting baselines, as demonstrated on 11 datasets with consistent gains in novel-class accuracy and harmonic mean, supported by ablations confirming the necessity of each component. The method preserves the frozen CLIP backbone while learning prompts and generators, enabling efficient adaptation and practical deployment, with code released for reproducibility.

Abstract

Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting, which leverages VLMs as knowledge bases to extract information beneficial for downstream tasks. However, existing methods primarily employ uni-modal prompting, which only engages a uni-modal branch, failing to simultaneously adjust vision-language (V-L) features. Additionally, the one-pass forward pipeline in VLM encoding struggles to align V-L features that have a huge gap. Confronting these challenges, we propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT). ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information. It comprises an initialization and a multi-modal iterative evolution (MIE) module. Initialization is responsible for encoding images and text using a VLM, followed by a feature filter that selects text features similar to image. MIE then facilitates multi-modal prompting through class-conditional vision prompting, instance-conditional text prompting, and feature filtering. In each MIE iteration, vision prompts are obtained from filtered text features via a vision generator, promoting image features to focus more on target object during vision prompting. The encoded image features are fed into a text generator to produce text prompts that are more robust to class shifts. Thus, V-L features are progressively aligned, enabling advance from coarse to exact prediction. Extensive experiments are conducted in three settings to evaluate the efficacy of ProMPT. The results indicate that ProMPT outperforms existing methods on average across all settings, demonstrating its superior generalization and robustness. Code is available at https://github.com/qiuxiaoyu9954/ProMPT.

Progressive Multi-modal Conditional Prompt Tuning

TL;DR

Abstract

Paper Structure (15 sections, 14 equations, 8 figures, 5 tables)

This paper contains 15 sections, 14 equations, 8 figures, 5 tables.

Introduction
RELATED WORKS
METHODOLOGY
Review of CLIP
Framework Overview
Initialization
Multi-modal Iterative Evolution
Training Objective
EXPERIMENTS
Experimental Setup
Generalization from Base-to-Novel Classes
Cross-dataset Evaluation
Domain Generalization
Ablation Study
CONCLUSION

Figures (8)

Figure 1: An illustration of our presented method. ProMPT progressively refines classification confidence, rectifying from "cat" to "dog" with iterative network processing.
Figure 2: An overview of our ProMPT framework, adopting an iterative strategy. It comprises an initialization and a multi-modal iterative evolution (MIE) module, aiming to progressively refine the predictions from rough to precise. Initialization contains CLIP and introduces a feature filter to select the top-$a$ text features most similar to image features. Each iteration in MIE involves class-conditional vision prompting, instance-conditional text prompting, and feature filtering. The top-$a$ features are fed into vision generator to produce vision prompts, and then the encoded image features are entered into text generator to obtain text prompts. Overall, ProMPT is optimized by minimizing the cumulative CE loss of the classification in MIE.
Figure 3: The implementation process of the feature filter ${\mathcal{F}}$. In the $n$-th iteration, feature filter takes image features ${x^n}$ and text features ${Z^n}$ as input, calculates their cosine similarity, and selects the top-$a$ text features ${\widetilde{Z}}^{n}$ based on the similarity. ${\widetilde{Z}}^{n}$ then serve as the inputs for the $(n+1)$-th iteration.
Figure 4: Class-conditional vision prompting and instance-conditional text prompting at the $n$-th iteration.
Figure 5: Comparison of ProMPT and CoCoOp on new and base classes. Generally speaking, ProMPT outperforms CoCoOp on most benchmarks. Some marginal declines are negligible when compared with the considerable enhancements of ProMPT.
...and 3 more figures

Progressive Multi-modal Conditional Prompt Tuning

TL;DR

Abstract

Progressive Multi-modal Conditional Prompt Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)