Bayesian Prompt Learning for Image-Language Model Generalization

Mohammad Mahdi Derakhshani; Enrique Sanchez; Adrian Bulat; Victor Guilherme Turrisi da Costa; Cees G. M. Snoek; Georgios Tzimiropoulos; Brais Martinez

Bayesian Prompt Learning for Image-Language Model Generalization

Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G. M. Snoek, Georgios Tzimiropoulos, Brais Martinez

TL;DR

The paper reframes prompt learning for image–language models as a Bayesian variational problem, modeling prompts as latent variables and learning a posterior over prompts to regularize the prompt space. It develops both conditional (image-conditioned residual prompts) and unconditional (global residual prompts) Bayesian prompt learning, deriving ELBO-based objectives and test-time sampling strategies to generate diverse, informative prompts. Across 15 benchmarks, the approach yields improved generalization to unseen prompts and robustness to domain shifts, while maintaining competitive in-domain performance. The results demonstrate that sampling-based prompt ensembles guided by a variational posterior can prevent overfitting to seen prompts and exploit transferable invariant features, with strong empirical gains and a public code release for reproduction.

Abstract

Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. Code available at: https://github.com/saic-fi/Bayesian-Prompt-Learning

Bayesian Prompt Learning for Image-Language Model Generalization

TL;DR

Abstract

Paper Structure (13 sections, 8 equations, 5 figures, 12 tables)

This paper contains 13 sections, 8 equations, 5 figures, 12 tables.

Introduction
Related Work
Method
Background
Conditional Bayesian Prompt Learning
Unconditional Bayesian Prompt Learning
Experiments and Results
Experimental Setup
Comparisons
Ablations
Conclusion
Hyperparameters
More Ablations

Figures (5)

Figure 1: We present a Bayesian perspective on prompt learning by formulating it as a variational inference problem (right column). Our framework models the prompt space as an a priori distribution which makes our proposal compatible with common prompt learning approaches that are unconditional (top) or conditional on the image (bottom).
Figure 2: Variational distribution interpretation on the EuroSAT dataset. The text encoding of the mean prompt $\mathbf{p}_{\mu(\mathbf{x})}$ () is the most similar to the image encoding. As we move further away from the mean prompt, the cosine similarity scores between the text encoding and image encoding decrease further (). When we ensemble the text encoding of different prompts the cosine similarity increases (), where the maximum similarity is obtained when all text encodings are combined.
Figure 3: Factor of variation analysis on Flowers102 for two different classes. Left: we plot prompt samples across five clusters. Right: we show the top-3 most representative samples within each cluster (e.g., 3 closest images to the centroids). There is a region where the contours for five different clusters intersect, indicating shared knowledge related to its corresponding class, while they diverge slightly, indicating a particular variation factor.
Figure 4: Variational distribution interpretation on the UCF101 dataset. The text encoding of the mean prompt $\mathbf{p}_{\mu(\mathbf{x})}$ () is the most similar to the image encoding. As we move further away from the mean prompt, the cosine similarity scores between the text encoding and image encoding decrease further (). When we ensemble the text encoding of different prompts the cosine similarity increases (), where the maximum similarity is obtained when all text encodings are combined.
Figure 5: Ablation of different vision encoder backbones with respect to unseen prompt generalization. A more over-parameterized model leads to better generalization performance across all datasets.

Bayesian Prompt Learning for Image-Language Model Generalization

TL;DR

Abstract

Bayesian Prompt Learning for Image-Language Model Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)