Table of Contents
Fetching ...

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

Silin Cheng, Kai Han

TL;DR

VaMP tackles the challenge of adapting vision-language models under limited supervision by introducing sample-specific, uncertainty-aware multi-modal prompts. It treats text prompts as latent variables inferred per input across multiple layers and regularizes them with a class-aware prior derived from class prototypes. The framework achieves state-of-the-art performance on base-to-novel generalization, domain generalization, and cross-dataset transfer in 16-shot settings while remaining parameter-efficient. These results demonstrate the value of modeling both instance-level uncertainty and global task structure in prompt-based multi-modal adaptation.

Abstract

Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

TL;DR

VaMP tackles the challenge of adapting vision-language models under limited supervision by introducing sample-specific, uncertainty-aware multi-modal prompts. It treats text prompts as latent variables inferred per input across multiple layers and regularizes them with a class-aware prior derived from class prototypes. The framework achieves state-of-the-art performance on base-to-novel generalization, domain generalization, and cross-dataset transfer in 16-shot settings while remaining parameter-efficient. These results demonstrate the value of modeling both instance-level uncertainty and global task structure in prompt-based multi-modal adaptation.

Abstract

Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp

Paper Structure

This paper contains 28 sections, 27 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Overview of the VaMP framework.(a) Class-Aware Prior Construction: Utilizing CLIP's frozen image encoder to process training samples, generating offline class prototypes for subsequent adaptation. (b) Variational Multi-Modal Prompt Adaptation (VMPA): Variational modeling mechanism where image-conditioned posterior $q_\phi(z_{i} \mid x)$ and class prototype-based prior $p_\psi(z_{i} \mid c_y)$ are aligned through KL divergence regularization of latent prompt distributions. (c) Training Pipeline: Full architecture of our proposed VaMP framework.
  • Figure 2: Qualitative analysis. Layer-wise visualization of aggregated posterior mean distributions $q_\phi(\mathbf{z}_i | x)$ for sample images from (a) ImageNet and (b) Flowers102.