Table of Contents
Fetching ...

Prompt Tuning with Soft Context Sharing for Vision-Language Models

Kun Ding, Ying Wang, Pengzhang Liu, Qiang Yu, Haojian Zhang, Shiming Xiang, Chunhong Pan

TL;DR

This work addresses adapting vision-language models to multiple few-shot tasks by exploiting inter-task relationships. It introduces SoftCPT, a soft context sharing prompt-tuning approach that uses a shared meta network to generate per-task prompts from task descriptions, trained jointly across all tasks. Empirical results across four multi-task datasets with 44 tasks and 1593 categories show SoftCPT consistently outperforming single-task prompt tuning and hard-sharing baselines, with notable gains in specialized domains and stable performance across backbones. The approach demonstrates the practical value of multi-task prompt learning for vision-language models and provides insights into task relatedness modeling via pre-trained language guidance.

Abstract

Vision-language models have recently shown great potential on many tasks in computer vision. Meanwhile, prior work demonstrates prompt tuning designed for vision-language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In practice, many few-shot tasks are inherently correlated, particularly within specialized domains. However, such information is overlooked previously. Inspired by the fact that modeling task relationship by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to tune pre-trained vision-language models on multiple target few-shot tasks jointly. Specifically, we design a task-shared meta network to generate prompt context for each task using task name together with a learnable task context as input. The parameters of this meta network as well as the task context are tuned on the joint training set of all tasks. As such, the prompt context of all tasks will be shared in a soft manner. Extensive experiments across four multi-task few-shot datasets covering 44 tasks and 1593 categories demonstrate that SoftCPT significantly outperforms single-task prompt tuning methods, highlighting the effectiveness of multi-task learning for vision-language prompt tuning. Code is available at https://github.com/kding1225/softcpt.

Prompt Tuning with Soft Context Sharing for Vision-Language Models

TL;DR

This work addresses adapting vision-language models to multiple few-shot tasks by exploiting inter-task relationships. It introduces SoftCPT, a soft context sharing prompt-tuning approach that uses a shared meta network to generate per-task prompts from task descriptions, trained jointly across all tasks. Empirical results across four multi-task datasets with 44 tasks and 1593 categories show SoftCPT consistently outperforming single-task prompt tuning and hard-sharing baselines, with notable gains in specialized domains and stable performance across backbones. The approach demonstrates the practical value of multi-task prompt learning for vision-language models and provides insights into task relatedness modeling via pre-trained language guidance.

Abstract

Vision-language models have recently shown great potential on many tasks in computer vision. Meanwhile, prior work demonstrates prompt tuning designed for vision-language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In practice, many few-shot tasks are inherently correlated, particularly within specialized domains. However, such information is overlooked previously. Inspired by the fact that modeling task relationship by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to tune pre-trained vision-language models on multiple target few-shot tasks jointly. Specifically, we design a task-shared meta network to generate prompt context for each task using task name together with a learnable task context as input. The parameters of this meta network as well as the task context are tuned on the joint training set of all tasks. As such, the prompt context of all tasks will be shared in a soft manner. Extensive experiments across four multi-task few-shot datasets covering 44 tasks and 1593 categories demonstrate that SoftCPT significantly outperforms single-task prompt tuning methods, highlighting the effectiveness of multi-task learning for vision-language prompt tuning. Code is available at https://github.com/kding1225/softcpt.
Paper Structure (21 sections, 1 theorem, 16 equations, 9 figures, 10 tables)

This paper contains 21 sections, 1 theorem, 16 equations, 9 figures, 10 tables.

Key Result

Proposition 1

For task-specified case with a linear sub-network, the new context for class names after one step update of SGD can be represented as where $\boldsymbol{S}'_t$, $\boldsymbol{S}_t$, $\eta$, $\boldsymbol{d}_{t}$, $\boldsymbol{g}_t$ and $\boldsymbol{C}_t$ are the new context of task $t$, old context of task $t$, learning rate, gradient of loss of task $t$ with respect to $\boldsymbol{S}_t$, task fea

Figures (9)

  • Figure 1: A conceptual comparison of different prompt tuning methods. (a) class-agnostic CoOp, (b) class-specified CoOp, (c) hard prompt sharing for CoOp, (d) our soft prompt sharing, (e) average performances on four datasets. In (a)-(d), we assume there are 2 classes per task and shared prompt contexts are in the same color.
  • Figure 2: Illustration of the proposed multi-task prompt tuning method SoftCPT. Unlike CoOp, a meta network is introduced to produce learnable context ($[S]_1^t[S]_2^t\cdots[S]_L^t$) for each task. The meta network consists of a frozen text encoder and a learnable sub-network. The text encoder extracts task features from task names, while the sub-network transforms the task features to learnable context of class names. For model training, SoftCPT uses samples from all tasks and the loss is first computed independently for each task. The summed loss of all tasks is then used for backpropagation (ref. Eq. \ref{['eq:total_loss']}). In the figure, [TASK] denotes token embeddings of task name, [CLASS] denotes token embeddings of a certain class name in a task.
  • Figure 3: Some example images from the Fashion-20 dataset. Note that only 8 out of the 20 tasks are being displayed.
  • Figure 4: Illustration of adding class features to task features. The context $[V]_1[V]_2\cdots[V]_K$ of length $K$ is learned. For the $t$-th task, there are $C_t$ classes.
  • Figure 5: (a) Results on Plant-6 with different backbones, (b) Relative Standard Deviation (RSD) on three datasets, (c) impact of varying prompt length on Plant-6, (d)-(f) comparison between task features learned with text encoder ('TE') and without text encoder from scratch ('Rand') on General-10, Plant-6 and Fashion-20, respectively. 'feat_dim' is the task feature's dimension.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof