Table of Contents
Fetching ...

DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification

Yuhao Wang, Yang Liu, Aihua Zheng, Pingping Zhang

TL;DR

DeMo tackles multi-modal object ReID under dynamic imaging quality by decoupling modality-specific and shared information and weighting decoupled features with an attention-guided mixture of experts. The approach combines Patch-Integrated Feature Extraction, hierarchical cross-modal decoupling, and attention-driven expert weighting to yield robust, adaptable representations across RGB, NIR, and TIR modalities. Empirical results on three benchmarks show strong performance and robustness to missing modalities, with comprehensive ablations and visualizations confirming each component's contribution. The work advances multi-modal ReID by integrating decoupled feature design with MoE and attention mechanisms, enabling reliable perception in challenging, modality-heterogeneous environments.

Abstract

Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three multi-modal object ReID benchmarks fully verify the effectiveness of our methods. The source code is available at https://github.com/924973292/DeMo.

DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification

TL;DR

DeMo tackles multi-modal object ReID under dynamic imaging quality by decoupling modality-specific and shared information and weighting decoupled features with an attention-guided mixture of experts. The approach combines Patch-Integrated Feature Extraction, hierarchical cross-modal decoupling, and attention-driven expert weighting to yield robust, adaptable representations across RGB, NIR, and TIR modalities. Empirical results on three benchmarks show strong performance and robustness to missing modalities, with comprehensive ablations and visualizations confirming each component's contribution. The work advances multi-modal ReID by integrating decoupled feature design with MoE and attention mechanisms, enabling reliable perception in challenging, modality-heterogeneous environments.

Abstract

Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three multi-modal object ReID benchmarks fully verify the effectiveness of our methods. The source code is available at https://github.com/924973292/DeMo.

Paper Structure

This paper contains 23 sections, 17 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: (a) The prevalent dynamic quality changes in multi-modal imaging. (b) Hierarchical feature decoupling. (c) The proposed modules and the framework of our DeMo.
  • Figure 2: The overall framework of our DeMo. We first employ a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity features from each modality. Then, the Hierarchical Decoupling Module (HDM) decouples multi-modal features into different levels with learnable query tokens. Finally, the Attention-Triggered Mixture of Experts (ATMoE) adaptively balances the decoupled features with accurate and context-aware weights, generating robust multi-modal features.
  • Figure 3: Detailed structure of ATMoE.
  • Figure 4: Feature distributions with t-SNE van2008visualizing. Different colors refer to different IDs.
  • Figure 5: Activation maps of decoupled features.
  • ...and 7 more figures