Table of Contents
Fetching ...

Prompt Diffusion Robustifies Any-Modality Prompt Learning

Yingjun Du, Gaowen Liu, Yuzhang Shang, Yuguang Yao, Ramana Kompella, Cees G. M. Snoek

Abstract

Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.

Prompt Diffusion Robustifies Any-Modality Prompt Learning

Abstract

Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.

Paper Structure

This paper contains 15 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Prompt diffusion enhances traditional prompt learning methods such as CoCoOp cocoop by introducing a diffusion process within the prompt space (colored arrows). Unlike deterministic prompt learning methods (black arrows), we employ a diffusion transformer to refine the prompts gradually. This process creates tailored prompts for each sample, complementing and augmenting existing prompting methods to achieve higher prediction accuracy through stronger generalization.
  • Figure 2: Per-sample prompt overfitting for textual prompt learning. Through a minimal number of iterations $I$ using gradient descent, we successfully derive overfitted prompts for each sample in the dataset. These overfitted prompts act as a "ground truth" for the prompts of each sample, enabling our proposed diffusion transformer to grasp the transition from generic prompts to highly personalized overfitted prompts.
  • Figure 3: Prompt diffusion. (1) Training by generating prompts that are initially overfitted using per-sample overfitting. These prompts are then subjected to a noise injection before entering the forward diffusion process. The inputs for diffusion include noisy prompts $\tilde{\bm{V}_t^*}$, the image features $\pi$, and a randomly chosen time step $t$, which leads to the generation of diffused prompts $\tilde{\bm{V}_t}$. After training, the diffusion transformer can convert generic prompts into their overfitted counterparts for each sample. (2) During testing, the sampling process begins with an initial random noise $\tilde{\bm{V}}_{T}$, which is gradually refined into diffused prompts $\tilde{\bm{V}}_{t}$. At each time step $t$, the sampling process incorporates the previous state $\tilde{\bm{V}}_{t-1}$, test image features $\pi$, and current time step $t$ as inputs. The resulting diffused prompts $\tilde{\bm{V}_0}$ are then employed to make test sample predictions. Throughout $T$ with our diffusion transformer, the vanilla prompts are adapted into customized prompts that contain more specific information about the test sample, thereby enhancing prediction accuracy.
  • Figure 4: Effect of number of function evaluation on base-to-new generalization.
  • Figure 5: Impact of iterations on per-sample prompt overfitting for novel classes.
  • ...and 2 more figures