Table of Contents
Fetching ...

DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification

Minghui Lin, Shu Wang, Xiang Wang, Jianhua Tang, Longbin Fu, Zhengrong Zuo, Nong Sang

TL;DR

DMPT tackles the high computational cost of multi-modal ReID by freezing the backbone and optimizing a compact set of decoupled modality-aware prompts. It combines modality-specific and modality-independent semantic prompts with a PromptIBind cross-modal interaction to exchange complementary information without corrupting modality-specific features. The approach yields competitive results on four benchmarks while requiring only a small fraction of tunable parameters, demonstrating strong efficiency and scalability. This work advances parameter-efficient fine-tuning in multi-modal perception by explicitly decoupling modalities and fostering cross-modal semantics through a novel bind-based mechanism.

Abstract

Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification, dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.

DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification

TL;DR

DMPT tackles the high computational cost of multi-modal ReID by freezing the backbone and optimizing a compact set of decoupled modality-aware prompts. It combines modality-specific and modality-independent semantic prompts with a PromptIBind cross-modal interaction to exchange complementary information without corrupting modality-specific features. The approach yields competitive results on four benchmarks while requiring only a small fraction of tunable parameters, demonstrating strong efficiency and scalability. This work advances parameter-efficient fine-tuning in multi-modal perception by explicitly decoupling modalities and fostering cross-modal semantics through a novel bind-based mechanism.

Abstract

Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification, dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.

Paper Structure

This paper contains 17 sections, 17 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a, c) Comparison of DMPT with traditional multi-modal object re-identification methods. Existing methods use a full fine-tuning architecture for direct feature fusion. In contrast, DMPT employs decoupled interaction based on prompt engineering. (b, c) Comparison of existing bind methods with our PromptIBind-based interaction method. For example, LanguageBindzhu2023languagebind uses language as a media for multi-modal alignment. In contrast, we introduce bind prompts for cross-modal inverse interaction.
  • Figure 2: Overview of our proposed Decoupled Modality-aware Prompt Tuning (DMPT) framework. We introduce text prompts, modality prompts, semantic prompts, and bind prompts for prompt tuning while keeping the backbone model parameters frozen. Text prompts are aligned with image features to enhance image representation. Visual prompts are decoupled into modality prompts and semantic prompts to further capture cross-modal semantic information. The PrompIBind-based interaction layer achieves cross-modal complementary information interaction synchronously through a "one-to-many" inverse bind structure.
  • Figure 3: Illustration of the bind prompt to modality prompt cross-attention interaction module.
  • Figure 4: Ablation of the semantic prompt length. We select the best semantic prompt length for our DMPT.
  • Figure 5: t-SNE visualization of features. Different colors represent different identities. (a) $\mathcal{S}_p$(baseline$_1$); (b) $\mathcal{S}_p$+$\mathcal{S}_p^{bind}$; (c) $\mathcal{S}_p$+$\mathcal{M}_p$+$\mathcal{S}_p^{bind}$; (d) $\mathcal{S}_p$+$\mathcal{M}_p$+$\mathcal{S}_p^{bind}$ +$W_t$.