Table of Contents
Fetching ...

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

Yuhao Wang, Yongfeng Lv, Pingping Zhang, Huchuan Lu

TL;DR

This work tackles robustness in multi-modal object ReID by introducing text-based semantic guidance through a standardized caption-generation pipeline and a novel IDEA framework. IDEA comprises the Inverted Multi-modal Feature Extractor (IMFE), which uses Modal Prefixes and an InverseNet to fuse text-guided semantics with visual features, and Cooperative Deformable Aggregation (CDA), which adaptively samples and aggregates discriminative local information with global context via cross-attention. The authors construct three text-enhanced benchmarks and demonstrate state-of-the-art performance on RGBNT201 and MSVR310, with clear ablations showing the contributions of IMFE, CDA, and text guidance. The approach improves robustness in challenging environments and illustrates how structured textual semantics can guide multi-modal feature learning for more accurate ReID. Collectively, the work advances text-informed multi-modal ReID and provides practical benchmarks and insights into effective cross-modal fusion.

Abstract

Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

TL;DR

This work tackles robustness in multi-modal object ReID by introducing text-based semantic guidance through a standardized caption-generation pipeline and a novel IDEA framework. IDEA comprises the Inverted Multi-modal Feature Extractor (IMFE), which uses Modal Prefixes and an InverseNet to fuse text-guided semantics with visual features, and Cooperative Deformable Aggregation (CDA), which adaptively samples and aggregates discriminative local information with global context via cross-attention. The authors construct three text-enhanced benchmarks and demonstrate state-of-the-art performance on RGBNT201 and MSVR310, with clear ablations showing the contributions of IMFE, CDA, and text guidance. The approach improves robustness in challenging environments and illustrates how structured textual semantics can guide multi-modal feature learning for more accurate ReID. Collectively, the work advances text-informed multi-modal ReID and provides practical benchmarks and insights into effective cross-modal fusion.

Abstract

Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.

Paper Structure

This paper contains 28 sections, 12 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: Overall illustration of our motivations and proposed framework. (a) Our multi-modal caption generation pipeline. (b) Limitations of existing MLLM-generated captions. (c) Comparison between previous methods and our proposed IDEA.
  • Figure 2: The upper row presents the template used to annotate our dataset. The lower row provides detailed information about the annotated dataset. (a) Category statistics for our annotated person ReID dataset. (b) Example images and captions from the RGBNT201 dataset. (c) Category statistics for our annotated vehicle ReID dataset. (d) Example images and captions from the MSVR310 dataset.
  • Figure 3: Illustration of the proposed IDEA framework. The upper part depicts the Inverted Multi-modal Feature Extractor (IMFE). It employs modal prefixes and an InverseNet to incorporate semantic text guidance for feature discriminability. The lower part highlights the Cooperative Deformable Aggregation (CDA), which adaptively integrates discriminative local information with global features. With the integration of IMFE and CDA, IDEA effectively extracts discriminative multi-modal features for object ReID.
  • Figure 4: Comparison with different hyper-parameters.
  • Figure 5: Visualization of the cosine similarity distribution.
  • ...and 14 more figures