Table of Contents
Fetching ...

Modular Prompt Learning Improves Vision-Language Models

Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, Jianxi Gao

TL;DR

This work addresses the challenge of preserving information in deep continuous prompts used by vision-language models during transfer to new tasks. It introduces Modular Prompt Learning (MPL), which uses three operations—adding, removing, and carrying prompts—to flexibly modify prompt counts across transformer layers while coupling language and vision prompts. By freezing the backbone and learning modular prompts, MPL achieves consistent gains in base-to-new generalization (average improvement ≈ 0.7% with up to 10.7% on EuroSAT) and competitive cross-dataset transfer (average ≈ 66.25% with EuroSAT improvement of 5.6%), while remaining efficient. The approach highlights that dynamic, layer-wise prompt management can surpass fixed prompt strategies, offering practical benefits for adapting pre-trained VLMs to diverse visual tasks.

Abstract

Pre-trained vision-language models are able to interpret visual concepts and language semantics. Prompt learning, a method of constructing prompts for text encoders or image encoders, elicits the potentials of pre-trained models and readily adapts them to new scenarios. Compared to fine-tuning, prompt learning enables the model to achieve comparable or better performance using fewer trainable parameters. Besides, prompt learning freezes the pre-trained model and avoids the catastrophic forgetting issue in the fine-tuning. Continuous prompts inserted into the input of every transformer layer (i.e. deep prompts) can improve the performances of pre-trained models on downstream tasks. For i-th transformer layer, the inserted prompts replace previously inserted prompts in the $(i-1)$-th layer. Although the self-attention mechanism contextualizes newly inserted prompts for the current layer and embeddings from the previous layer's output, removing all inserted prompts from the previous layer inevitably loses information contained in the continuous prompts. In this work, we propose Modular Prompt Learning (MPL) that is designed to promote the preservation of information contained in the inserted prompts. We evaluate the proposed method on base-to-new generalization and cross-dataset tasks. On average of 11 datasets, our method achieves 0.7% performance gain on the base-to-new generalization task compared to the state-of-the-art method. The largest improvement on the individual dataset is 10.7% (EuroSAT dataset).

Modular Prompt Learning Improves Vision-Language Models

TL;DR

This work addresses the challenge of preserving information in deep continuous prompts used by vision-language models during transfer to new tasks. It introduces Modular Prompt Learning (MPL), which uses three operations—adding, removing, and carrying prompts—to flexibly modify prompt counts across transformer layers while coupling language and vision prompts. By freezing the backbone and learning modular prompts, MPL achieves consistent gains in base-to-new generalization (average improvement ≈ 0.7% with up to 10.7% on EuroSAT) and competitive cross-dataset transfer (average ≈ 66.25% with EuroSAT improvement of 5.6%), while remaining efficient. The approach highlights that dynamic, layer-wise prompt management can surpass fixed prompt strategies, offering practical benefits for adapting pre-trained VLMs to diverse visual tasks.

Abstract

Pre-trained vision-language models are able to interpret visual concepts and language semantics. Prompt learning, a method of constructing prompts for text encoders or image encoders, elicits the potentials of pre-trained models and readily adapts them to new scenarios. Compared to fine-tuning, prompt learning enables the model to achieve comparable or better performance using fewer trainable parameters. Besides, prompt learning freezes the pre-trained model and avoids the catastrophic forgetting issue in the fine-tuning. Continuous prompts inserted into the input of every transformer layer (i.e. deep prompts) can improve the performances of pre-trained models on downstream tasks. For i-th transformer layer, the inserted prompts replace previously inserted prompts in the -th layer. Although the self-attention mechanism contextualizes newly inserted prompts for the current layer and embeddings from the previous layer's output, removing all inserted prompts from the previous layer inevitably loses information contained in the continuous prompts. In this work, we propose Modular Prompt Learning (MPL) that is designed to promote the preservation of information contained in the inserted prompts. We evaluate the proposed method on base-to-new generalization and cross-dataset tasks. On average of 11 datasets, our method achieves 0.7% performance gain on the base-to-new generalization task compared to the state-of-the-art method. The largest improvement on the individual dataset is 10.7% (EuroSAT dataset).

Paper Structure

This paper contains 18 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the proposed framework. We use deep prompts with coupling function $\mathcal{F}$. The coupling function bridges prompts for the language branch to those for the visual branch. Three operations $\mathcal{O}_{\rm add}$, $\mathcal{O}_{\rm rm}$ and $\mathcal{O}_{\rm cr}$ are applied to enable the varying number of continuous prompts inserted to transformer layers. $\mathcal{O}_{\rm add}$ is applied to inputs of transformer layers while $\mathcal{O}_{\rm rm}$ and $\mathcal{O}_{\rm cr}$ is applied to the output of transformer layers.
  • Figure 2: Performance on base-to-new generalization (left) and cross-dataset evaluation (right) with respect to the average time for training. For base-to-new generalization task, the running time is averaged over 10 datasets. For cross-dataset evaluation task, the running time is the time for training model on ImageNet dataset.