SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Yang Zhou; Yongjian Wu; Jiya Saiyin; Bingzheng Wei; Maode Lai; Eric Chang; Yan Xu

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Yang Zhou, Yongjian Wu, Jiya Saiyin, Bingzheng Wei, Maode Lai, Eric Chang, Yan Xu

TL;DR

SDPT introduces Synchronous Dual Prompt Tuning for fusion-based visual-language pre-trained models by placing a single set of unified prototype tokens inside the cross-attention fusion space and deriving inverse linear projections from pre-trained query transforms. This preserves the pre-trained text-image aligning knowledge and eliminates the need for training extra modal mappings, achieving superior transfer with only 0.04% of parameters. Across COCO, LVIS, and ODinW13, SDPT outperforms both single- and dual-modal PEFT methods, including full fine-tuning, and shows robustness in few-shot and self-training scenarios. The approach is compatible with existing PEFT modules and supports broader tasks, suggesting strong practical impact for efficient adaptation of fusion-based VLPMs.

Abstract

Prompt tuning methods have achieved remarkable success in parameter-efficient fine-tuning on large pre-trained models. However, their application to dual-modal fusion-based visual-language pre-trained models (VLPMs), such as GLIP, has encountered issues. Existing prompt tuning methods have not effectively addressed the modal mapping and aligning problem for tokens in different modalities, leading to poor transfer generalization. To address this issue, we propose Synchronous Dual Prompt Tuning (SDPT). SDPT initializes a single set of learnable unified prototype tokens in the established modal aligning space to represent the aligned semantics of text and image modalities for downstream tasks. Furthermore, SDPT establishes inverse linear projections that require no training to embed the information of unified prototype tokens into the input space of different modalities. The inverse linear projections allow the unified prototype token to synchronously represent the two modalities and enable SDPT to share the unified semantics of text and image for downstream tasks across different modal prompts. Experimental results demonstrate that SDPT assists fusion-based VLPMs to achieve superior outcomes with only 0.04\% of model parameters for training across various scenarios, outperforming other single- or dual-modal methods. The code will be released at https://github.com/wuyongjianCODE/SDPT.

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

TL;DR

Abstract

Paper Structure (25 sections, 4 equations, 6 figures, 20 tables)

This paper contains 25 sections, 4 equations, 6 figures, 20 tables.

Introduction
Related Works
Method
Preliminary
Synchronous Dual Prompt Tuning
Unified Prototype Token
Inverse Linear Projections
Experiments
Downstream Tasks
Comparison methods
Implementation
Main Comparison Results
Generality and Flexibility
Ablation Studies
Conclusion
...and 10 more sections

Figures (6)

Figure 1: Synchronous Dual Prompt Tuning (SDPT) vs. other dual-modal PEFT methods on fusion-based VLPMs. (a) Existing dual-modal PEFT methods (left) require learning new modal mapping substructures or modality aligning spaces, whereas SDPT (right) does not and thus achieves better PEFT performance for fusion-based VLPMs on new tasks. (b) Performance of different methods on 13 downstream tasks in ODinW13 li2022elevater for GLIP-L, with mean and standard deviation annotated. SDPT (k=10) outperforms full fine-tuning while using only 0.04% of all model parameters.
Figure 2: Detailed illustration of Synchronous Dual Prompt Tuning (SDPT). $\operatorname{X-MHA}$ refers to the cross attention layer. Unified prototype tokens $Z^i$ in each X-MHA layer are tuned through inverse linear projections to synchronously incorporate dual-modality knowledge for the new task while keeping the other parameters of the network frozen.
Figure 3: Comparison of attention map visualization of the different methods on LVIS (line 1 6) and PascalVOC (line 7 12). (a) Original image and ground truths, (b) LoRA, (c) BitFit, (d) DPT, (e) Apollo, (f) PMF, (g) UPT, (h) MaPLe, (i) SDPT (k=10), (j) SDPT (k=120 for LVIS, 200 for PascalVOC). Ground truths are marked by red boxes.
Figure 4: The optimization curve for hyperparameters, compared with straightforward combinations of single-modal methods. The blue dash line is the optimal score of "CoOp-VPT" and "VPT-CoOp" while the green dot line is the optimal score of "Adapter-Adaptformer" and "Adaptformer-Adapter".
Figure 5: Comparison of attention map visualization of the different methods on COCO. (a) Original image and ground truths, (b) LoRA, (c) BitFit, (d) DPT, (e) Apollo, (f) PMF, (g) UPT, (h) MaPLe, (i) SDPT (k=10), (j) SDPT (k=120). Ground truths are marked by red boxes.
...and 1 more figures

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

TL;DR

Abstract

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)