Table of Contents
Fetching ...

Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification

Yangyang Liu, Yuhao Wang, Pingping Zhang

TL;DR

This work tackles background interference and cross-modal misalignment in multi-modal object ReID by introducing Signal, a tri-module framework with SIM for selective cross- and intra-modal patch token interaction, GAM for anchor-free global alignment in gramian space via Vol$(f_R',f_N',f_T') = \sqrt{\det G}$, and LAM for local shift-aware alignment through deformable sampling. The model is trained with a composite objective that blends cross-entropy, triplet, gramian contrastive, and MSE losses to enforce both global and local consistency across modalities. Empirical results on RGBNT201, RGBNT100, and MSVR310 demonstrate state-of-the-art performance and strong ablations validating the contributions. The approach yields robust, background-robust, and alignment-consistent multi-modal representations with practical applicability and available source code.

Abstract

Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the gramian space. Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at https://github.com/010129/Signal.

Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification

TL;DR

This work tackles background interference and cross-modal misalignment in multi-modal object ReID by introducing Signal, a tri-module framework with SIM for selective cross- and intra-modal patch token interaction, GAM for anchor-free global alignment in gramian space via Vol, and LAM for local shift-aware alignment through deformable sampling. The model is trained with a composite objective that blends cross-entropy, triplet, gramian contrastive, and MSE losses to enforce both global and local consistency across modalities. Empirical results on RGBNT201, RGBNT100, and MSVR310 demonstrate state-of-the-art performance and strong ablations validating the contributions. The approach yields robust, background-robust, and alignment-consistent multi-modal representations with practical applicability and available source code.

Abstract

Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the gramian space. Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at https://github.com/010129/Signal.

Paper Structure

This paper contains 17 sections, 30 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Motivations of our framework. (a) Background interferences in multi-modal images. (b) Comparison between previous pairwise alignment and our simultaneous alignment. (c) Misalignments exist across different modalities.
  • Figure 2: Illustration of our proposed framework.
  • Figure 3: Details of Global Alignment Module.
  • Figure 4: Mechanism of Local Alignment Module.
  • Figure 5: (a) Performance with different numbers of reserved tokens. (b) Performance with different $\alpha$ and $\beta$.
  • ...and 4 more figures