Table of Contents
Fetching ...

Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models

Xinyang Liu, Dongsheng Wang, Bowei Fang, Miaoge Li, Zhibin Duan, Yishi Xu, Bo Chen, Mingyuan Zhou

TL;DR

This work tackles the challenge of prompting vision-language models with diverse, label-specific concepts while avoiding overfitting. It introduces PBPrompt, a Bayesian framework that generates stochastic, label-conditioned prompts via a hierarchical generator and couples them to visual patches through a bidirectional conditional transport (CT) regularization. The method optimizes a combined ELBO that includes a contextual prior and the CT regularization, enabling robust few-shot, base-to-new generalization, and cross-domain transfer. Empirical results across 15 datasets and multiple tasks demonstrate consistent improvements over strong baselines, with interpretable visual and textual prompt analyses via transport plans and captioning systems. The approach offers a scalable, semantically grounded alternative to deterministic prompt tuning with practical benefits for generalization in vision-language tasks.

Abstract

For downstream applications of vision-language pre-trained models, there has been significant interest in constructing effective prompts. Existing works on prompt engineering, which either require laborious manual designs or optimize the prompt tuning as a point estimation problem, may fail to describe diverse characteristics of categories and limit their applications. We introduce a Bayesian probabilistic resolution to prompt tuning, where the label-specific stochastic prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model. Importantly, we semantically regularize the tuning process by minimizing the statistical distance between the visual patches and linguistic prompts, which pushes the stochastic label representations to faithfully capture diverse visual concepts, instead of overfitting the training categories. We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts. Extensive results over 15 datasets show promising transferability and generalization performance of our proposed model, both quantitatively and qualitatively.

Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models

TL;DR

This work tackles the challenge of prompting vision-language models with diverse, label-specific concepts while avoiding overfitting. It introduces PBPrompt, a Bayesian framework that generates stochastic, label-conditioned prompts via a hierarchical generator and couples them to visual patches through a bidirectional conditional transport (CT) regularization. The method optimizes a combined ELBO that includes a contextual prior and the CT regularization, enabling robust few-shot, base-to-new generalization, and cross-domain transfer. Empirical results across 15 datasets and multiple tasks demonstrate consistent improvements over strong baselines, with interpretable visual and textual prompt analyses via transport plans and captioning systems. The approach offers a scalable, semantically grounded alternative to deterministic prompt tuning with practical benefits for generalization in vision-language tasks.

Abstract

For downstream applications of vision-language pre-trained models, there has been significant interest in constructing effective prompts. Existing works on prompt engineering, which either require laborious manual designs or optimize the prompt tuning as a point estimation problem, may fail to describe diverse characteristics of categories and limit their applications. We introduce a Bayesian probabilistic resolution to prompt tuning, where the label-specific stochastic prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model. Importantly, we semantically regularize the tuning process by minimizing the statistical distance between the visual patches and linguistic prompts, which pushes the stochastic label representations to faithfully capture diverse visual concepts, instead of overfitting the training categories. We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts. Extensive results over 15 datasets show promising transferability and generalization performance of our proposed model, both quantitatively and qualitatively.
Paper Structure (38 sections, 12 equations, 10 figures, 19 tables, 1 algorithm)

This paper contains 38 sections, 12 equations, 10 figures, 19 tables, 1 algorithm.

Figures (10)

  • Figure 1: The motivation of the proposed model. Multiple prompts are generated from the label-specific distributions.
  • Figure 2: Overview of the proposed PBPrompt. PBPrompt generates the stochastic prompts by first sampling a label-specific vector $\boldsymbol{r} _c$ and then employing a single-layer self-attention generator. CT distance is performed between the textual prompts and image patches to regularize the prompts with the visual knowledge.
  • Figure 3: The few-shot learning results on 11 datasets. We compare our PBPrompt with CoOp, CoCoOp and PLOT. Overall, our proposed model outperforms the baselines in most cases. More numerical results can be found at Table \ref{['tab: vit_fsl']} and Table \ref{['tab: rn50_fsl']}.
  • Figure 4: Performance comparison on base-to-new generalization evaluated by harmonic mean. More results can be found at Table \ref{['tab: vit_b2n']} and \ref{['tab: rn50_b2n']}.
  • Figure 5: Monte Carlo sampling numbers
  • ...and 5 more figures