Table of Contents
Fetching ...

MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt

Yuhao Wang, Xuehu Liu, Tianyu Yan, Yang Liu, Aihua Zheng, Pingping Zhang, Huchuan Lu

TL;DR

This work addresses the challenge of robust multi-modal object ReID by leveraging CLIP as a backbone and addressing long-sequence fusion efficiency. It introduces Parallel Feed-Forward Adapter (PFA) for knowledge transfer, Synergistic Residual Prompt (SRP) for cross-modal prompt synergy, and Mamba Aggregation (MA) for linear-complexity intra- and inter-modality modeling. Across RGBNT201, RGBNT100, and MSVR310, MambaPro delivers state-of-the-art results with notable improvements and reduced trainable parameters compared to fully fine-tuned CLIP and other transformers, while providing interpretable Grad-CAM visualizations. The approach has practical impact for surveillance and cross-modal recognition, offering scalable fusion of heterogeneous modalities and potential generalization to broader domains and long-sequence scenarios.

Abstract

Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods. The source code is available at https://github.com/924973292/MambaPro.

MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt

TL;DR

This work addresses the challenge of robust multi-modal object ReID by leveraging CLIP as a backbone and addressing long-sequence fusion efficiency. It introduces Parallel Feed-Forward Adapter (PFA) for knowledge transfer, Synergistic Residual Prompt (SRP) for cross-modal prompt synergy, and Mamba Aggregation (MA) for linear-complexity intra- and inter-modality modeling. Across RGBNT201, RGBNT100, and MSVR310, MambaPro delivers state-of-the-art results with notable improvements and reduced trainable parameters compared to fully fine-tuned CLIP and other transformers, while providing interpretable Grad-CAM visualizations. The approach has practical impact for surveillance and cross-modal recognition, offering scalable fusion of heterogeneous modalities and potential generalization to broader domains and long-sequence scenarios.

Abstract

Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods. The source code is available at https://github.com/924973292/MambaPro.

Paper Structure

This paper contains 28 sections, 23 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: (a) Comparison between previous methods and MambaPro. (b) FLOPs comparison between SSM and SA.
  • Figure 2: The overall framework of MambaPro. The Parallel Feed-Forward Adapter (PFA) is first introduced to transfer pre-trained knowledge into the ReID task. Then, the Synergistic Residual Prompt (SRP) is inserted to guide the progressive fusion of multi-modal features. Finally, the Mamba Aggregation (MA) is proposed to model interactions of long sequences from different modalities. With the proposed modules, our framework can obtain more robust features with low computational complexity.
  • Figure 3: Details of our proposed Mamba Aggregation.
  • Figure 4: Details of different prompt mechanisms.
  • Figure 5: Feature distributions with t-SNE van2008visualizing. Different colors represent different IDs.
  • ...and 6 more figures