Table of Contents
Fetching ...

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Pingping Zhang, Yuhao Wang, Yang Liu, Zhengzheng Tu, Huchuan Lu

TL;DR

This work tackles robust multi-modal object ReID by addressing background interference and modality gaps through EDITOR, a Transformer-based framework that learns object-centric token representations. It introduces Spatial-Frequency Token Selection (SFTS) to capture diverse, modality-aware tokens and Hierarchical Masked Aggregation (HMA) to fuse intra- and inter-modal features. Complementary losses, Background Consistency Constraint ($\mathcal{L}_{BCC}$) and Object-Centric Feature Refinement ($\mathcal{L}_{OCFR}$), stabilize token selection and tighten intra-ID clustering while separating different IDs. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 demonstrate competitive or superior performance against state-of-the-art methods, with clear gains from ablations validating the value of each component. The approach offers practical impact for robust cross-modal retrieval in real-world surveillance and automotive scenarios where backgrounds and modality gaps are prevalent.

Abstract

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR.

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

TL;DR

This work tackles robust multi-modal object ReID by addressing background interference and modality gaps through EDITOR, a Transformer-based framework that learns object-centric token representations. It introduces Spatial-Frequency Token Selection (SFTS) to capture diverse, modality-aware tokens and Hierarchical Masked Aggregation (HMA) to fuse intra- and inter-modal features. Complementary losses, Background Consistency Constraint () and Object-Centric Feature Refinement (), stabilize token selection and tighten intra-ID clustering while separating different IDs. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 demonstrate competitive or superior performance against state-of-the-art methods, with clear gains from ablations validating the value of each component. The approach offers practical impact for robust cross-modal retrieval in real-world surveillance and automotive scenarios where backgrounds and modality gaps are prevalent.

Abstract

Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR.
Paper Structure (22 sections, 24 equations, 17 figures, 7 tables)

This paper contains 22 sections, 24 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Comparison of different methods and token selections. (a) Framework of previous methods; (b) Framework of our proposed EDITOR; (c) RGB images; (d) Spatial-based token selection; (e) Multi-modal frequency transform; (f) Frequency-based token selection; (g) Selected tokens in the NIR modality; (h) Selected tokens in the TIR modality.
  • Figure 2: An illustration of our proposed EDITOR. First, features from different input modalities are extracted by using the shared ViT-B/16 backbone. Then, a Spatial-Frequency Token Selection (SFTS) is utilized to select diverse tokens with object-centric features. Meanwhile, the Background Consistency Constraint (BCC) loss is designed for stabilizing the selection process. After that, a Hierarchical Masked Aggregation (HMA) is grafted to aggregate the selected tokens. Finally, combined with the Object-Centric Feature Refinement (OCFR) loss, the whole framework can obtain more discriminative features for multi-modal object ReID.
  • Figure 3: Illustration of spatial-based token selection.
  • Figure 4: Illustration of frequency-based token selection.
  • Figure 5: Alignment visualization in HMA with RGB modality.
  • ...and 12 more figures