Table of Contents
Fetching ...

ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model

Jialong Zuo, Yongtai Deng, Mengdan Tan, Rui Jin, Dongyue Wu, Nong Sang, Liang Pan, Changxin Gao

TL;DR

Omni Multi-modal Person Re-identification (OM-ReID) addresses retrieving a person using queries from any single modality or arbitrary modality combinations. The authors introduce ORBench, a high-quality five-modality dataset (RGB, infrared, color pencil, sketch, and text), and ReID5o, a unified framework with a multi-modal tokenizing assembler, a multi-expert router with modality-specific adapters, and a feature mixture that enables cross-modal alignment via SDM IRRA and identity losses. Extensive experiments show that multi-modal queries significantly improve retrieval performance and that ReID5o achieves state-of-the-art results across all modality combinations, validating both dataset quality and methodological effectiveness. This work establishes a solid foundation for OM-ReID research and provides public dataset and code to catalyze further exploration in multi-modal person ReID.

Abstract

In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.

ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model

TL;DR

Omni Multi-modal Person Re-identification (OM-ReID) addresses retrieving a person using queries from any single modality or arbitrary modality combinations. The authors introduce ORBench, a high-quality five-modality dataset (RGB, infrared, color pencil, sketch, and text), and ReID5o, a unified framework with a multi-modal tokenizing assembler, a multi-expert router with modality-specific adapters, and a feature mixture that enables cross-modal alignment via SDM IRRA and identity losses. Extensive experiments show that multi-modal queries significantly improve retrieval performance and that ReID5o achieves state-of-the-art results across all modality combinations, validating both dataset quality and methodological effectiveness. This work establishes a solid foundation for OM-ReID research and provides public dataset and code to catalyze further exploration in multi-modal person ReID.

Abstract

In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.

Paper Structure

This paper contains 28 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Our ReID5o can effectively conduct retrieval with any combinations of five modalities, adapting to various queries with different uncertain modalities in real scenarios. However, existing methods SYSUMM01cuhkpedeszhai2022trireidchen2023modalityagnosticAIO are constrained to few modalities and are unable to achieve arbitrary retrieval with five modalities.
  • Figure 2: The overview of our proposed ORBench dataset. Our dataset is remarkable for containing rich, high-quality and diverse five-modal descriptive data for the same person, offering a comprehensive and in-depth resource for person ReID research.
  • Figure 3: The schematic of our proposed ReID5o framework. As the unified multi-modal encoder extracts the modality-shared features, our specially designed multi-expert router can effectively promote the modality-specific representation learning.
  • Figure 4: Comparisons of the Shannon entropy per textual description with existing person ReID datasets containing the text modality. The average entropy of our dataset reaches 5.53, representing the highest level of textual information richness among current datasets.
  • Figure 5: Public evaluation to assess the identity consistency and perspective comformity of the color pencil drawings in ORBench.
  • ...and 2 more figures