MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt
Yuhao Wang, Xuehu Liu, Tianyu Yan, Yang Liu, Aihua Zheng, Pingping Zhang, Huchuan Lu
TL;DR
This work addresses the challenge of robust multi-modal object ReID by leveraging CLIP as a backbone and addressing long-sequence fusion efficiency. It introduces Parallel Feed-Forward Adapter (PFA) for knowledge transfer, Synergistic Residual Prompt (SRP) for cross-modal prompt synergy, and Mamba Aggregation (MA) for linear-complexity intra- and inter-modality modeling. Across RGBNT201, RGBNT100, and MSVR310, MambaPro delivers state-of-the-art results with notable improvements and reduced trainable parameters compared to fully fine-tuned CLIP and other transformers, while providing interpretable Grad-CAM visualizations. The approach has practical impact for surveillance and cross-modal recognition, offering scalable fusion of heterogeneous modalities and potential generalization to broader domains and long-sequence scenarios.
Abstract
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods. The source code is available at https://github.com/924973292/MambaPro.
