Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior

Youngjae Cho; HeeSun Bae; Seungjae Shin; Yeo Dong Youn; Weonyoung Joo; Il-Chul Moon

Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior

Youngjae Cho, HeeSun Bae, Seungjae Shin, Yeo Dong Youn, Weonyoung Joo, Il-Chul Moon

TL;DR

The paper tackles overfitting and poor adaptability in prompt learning for frozen Vision-Language Pretrained models under few-shot and distribution-shift scenarios. It proposes Adaptive Particle-based Prompt Learning (APP), a Bayesian framework that maintains a multimodal posterior over prompts by employing a data-dependent prior and estimating the posterior via Wasserstein Gradient Flow with Stein Variational Gradient Descent. It further extends adaptation to unseen data through test-data-dependent priors and trains the prior network by maximizing a mutual-information bound with image features. Empirically, APP yields consistent gains across 11 datasets for few-shot classification and improves domain-generalization performance on ImageNet, while qualitative analyses show prompts capturing diverse image-feature modes; code is provided for reproducibility.

Abstract

Recent Vision-Language Pretrained (VLP) models have become the backbone for many downstream tasks, but they are utilized as frozen model without learning. Prompt learning is a method to improve the pre-trained VLP model by adding a learnable context vector to the inputs of the text encoder. In a few-shot learning scenario of the downstream task, MLE training can lead the context vector to over-fit dominant image features in the training data. This overfitting can potentially harm the generalization ability, especially in the presence of a distribution shift between the training and test dataset. This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application and increase the adaptability of prompts on unseen instances. Specifically, modeling data-dependent prior enhances the adaptability of text features for both seen and unseen image features without the trade-off of performance between them. Based on the Bayesian framework, we utilize the Wasserstein Gradient Flow in the estimation of our target posterior distribution, which enables our prompt to be flexible in capturing the complex modes of image features. We demonstrate the effectiveness of our method on benchmark datasets for several experiments by showing statistically significant improvements on performance compared to existing methods. The code is available at https://github.com/youngjae-cho/APP.

Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior

TL;DR

Abstract

Paper Structure (34 sections, 1 theorem, 25 equations, 4 figures, 21 tables, 2 algorithms)

This paper contains 34 sections, 1 theorem, 25 equations, 4 figures, 21 tables, 2 algorithms.

Introduction
Preliminary
Formulation of Prompt Learning
Deterministic Prompt Learning
Probabilistic Prompt Learning
Bayesian Probabilistic Prompt Learning
Wasserstein Gradient Flow
Data-Dependent Prior
Method
Formulation of Prompt Posterior Distribution
Bayesian Adaptation of Prompt to Test data
Variational Inference for Prompt Posterior
Parameter Training of Data-Dependent Prior
Adaptation $\theta$ with Test Data-Dependent Prior
Results
...and 19 more sections

Key Result

Proposition 1

Suppose that the Markov chain assumption holds as $f(X) \rightarrow \phi(f(X)) \rightarrow g(\phi(f(X)), \cdot)$, then the lower bound of the mutual information, $I(f(X);\phi(f(X)))$, is derived as follows: $I(f(X);\phi(f(X))) \ge I(f(X);g(\phi(f(X)), \cdot)) \ge \log C - \mathcal{L}_{CE} (\phi(f(X)

Figures (4)

Figure 1: Structure (left) and learning dynamics (right) of APP. Multiple context vectors are particles of approximated distribution and image conditioned prior can guide the context vector to capture the multi modes.
Figure 2: Result of Few-shot Classification. We conduct three-replicated experiments.
Figure 3: Umap visualization about image features and text features for EuroSAT. (Upper) Histograms correspond to image features and $\star$ means text features of all classes. (Lower) Image and text features of arbitrary two classes. The color coding corresponds to each class.
Figure 4: Sensitivity analysis on $\alpha$, effect of test data-dependent prior. Experiments are replicated over 3 times.

Theorems & Definitions (2)

Definition 1
Proposition 1

Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior

TL;DR

Abstract

Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)