Differentiable Prompt Learning for Vision Language Models

Zhenhan Huang; Tejaswini Pedapati; Pin-Yu Chen; Jianxi Gao

Differentiable Prompt Learning for Vision Language Models

Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, Jianxi Gao

TL;DR

This work tackles the challenge of designing effective deep continuous prompts for vision-language models by automatically determining per-layer prompt context lengths and depths. It introduces Differentiable Prompt Learning (DPL), a bilevel optimization framework that uses differentiable relaxations to search over per-layer prompt configurations, followed by a training stage that fine-tunes the final subprompts with optional knowledge distillation. The method employs cross-attention to fuse multiple prompt options at each layer and reports a systematic improvement of about $2.60\%$ average accuracy on 11 datasets with a ViT-B/16 CLIP backbone in few-shot settings, highlighting dataset-dependent, asymmetric prompt configurations. While the searching stage is computationally intensive due to the combinatorial space of prompt configurations, the approach remains compatible with existing prompt-learning designs and offers a practical, scalable path to tailoring prompts for diverse downstream tasks. Overall, DPL demonstrates that automatic, layer-wise customization of continuous prompts can surpass manually designed configurations and adapt to distribution shifts across datasets.

Abstract

Prompt learning is an effective way to exploit the potential of large-scale pre-trained foundational models. Continuous prompts parameterize context tokens in prompts by turning them into differentiable vectors. Deep continuous prompts insert prompts not only in the input but also in the intermediate hidden representations. Manually designed deep continuous prompts exhibit a remarkable improvement compared to the zero-shot pre-trained model on downstream tasks. How to automate the continuous prompt design is an underexplored area, and a fundamental question arises, is manually designed deep prompt strategy optimal? To answer this question, we propose a method dubbed differentiable prompt learning (DPL). The DPL method is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer, where the objective is to maximize the performance. We test the DPL method on the pre-trained CLIP. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence. The performance on the downstream tasks exhibits the superiority of the automatic design: our method boosts the average test accuracy by 2.60% on 11 datasets compared to baseline methods. Besides, our method focuses only on the prompt configuration (i.e. context length for each layer), which means that our method is compatible with the baseline methods that have sophisticated designs to boost the performance. The DPL method can be deployed to large language models or computer vision models at no cost.

Differentiable Prompt Learning for Vision Language Models

TL;DR

average accuracy on 11 datasets with a ViT-B/16 CLIP backbone in few-shot settings, highlighting dataset-dependent, asymmetric prompt configurations. While the searching stage is computationally intensive due to the combinatorial space of prompt configurations, the approach remains compatible with existing prompt-learning designs and offers a practical, scalable path to tailoring prompts for diverse downstream tasks. Overall, DPL demonstrates that automatic, layer-wise customization of continuous prompts can surpass manually designed configurations and adapt to distribution shifts across datasets.

Abstract

Paper Structure (25 sections, 15 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 15 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Background and Related Work
Prompt Learning
Neural Architecture Search
Revisiting Differentiable NAS
Differentiable Prompt Learning
Searching Stage
Training Stage
Experiments
Datasets and Experiment Setup
Datasets
Baselines
Experiment Details
Determining Context Lengths of Continuous Prompts
Prompt Learning Based on Alpha Matrix
...and 10 more sections

Figures (7)

Figure 1: (a) In the searching stage, continuous prompts with different context lengths (shown in red color) are added to the original context tokens (shown in blue color) as the input to transformer blocks in the text branch or image branch. The outputs of transformer blocks are used as the original context tokens for the next transformer blocks. $\{\alpha^{(l)}_i\}$ are differentiable parameters to control the contribution of different prompt options. (b) After the searching stage, two $\alpha$ matrices are obtained to indicate the selection of the search algorithm for differentiable context tokens in the language branch and the image branch. (c) In the training stage, prompt learning is conducted using the differentiable context token setting determined by the search algorithm.
Figure 2: (a) $\alpha$ matrices for the text branch. (b) $\alpha$ matrices for the image branch. $\alpha$ matrices are obtained at the epoch of 60. The row dimension is related to the context length of added continuous prompts. The column dimension is related to model depth, i.e. the number of transformer blocks. (c) The evolution of the $\alpha$ difference and the number of dominants. As the number of training epochs increases, the $\alpha$ matrix gradually converges and the number of dominants increases to the converged value.
Figure 3: Original image and Grad-CAM visualization selvaraju2017grad for various methods on FGVCAircraft, StanfordCars and UCF101 datasets. Our method helps the pre-trained model focus on key elements in the foreground and avoid distraction from the background. The text template a photo of a [class] is used in the Grad-CAM calculation.
Figure 4: Evolution of $\alpha$ matrices using various datasets in the searching stage. Although $\alpha$ matrices have the same random initialization for the same text branch or image branch, the converged matrices are different for different datasets. It indicates that the prompt learning method depends on the distribution shift but the existing manually designed prompt learning method uses the same context length for different downstream dataets.
Figure 5: The computational complexity of the optimal prompt configuration determined by the DPL method.
...and 2 more figures

Theorems & Definitions (1)

Definition 4.1: single-dominant

Differentiable Prompt Learning for Vision Language Models

TL;DR

Abstract

Differentiable Prompt Learning for Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (1)