All in One Framework for Multimodal Re-identification in the Wild

He Li; Mang Ye; Ming Zhang; Bo Du

All in One Framework for Multimodal Re-identification in the Wild

He Li, Mang Ye, Ming Zhang, Bo Du

TL;DR

The paper tackles the problem of unseen, uncertain multimodal ReID by introducing All-in-One (AIO), a framework that unifies RGB, IR, Sketch, and Text through a multimodal tokenizer and a frozen foundation model as a shared encoder. It couples this backbone with three cross-modal heads—Conventional Classification, Vision Guided Masked Attribute Modeling, and Multimodal Feature Binding—and employs missing-modality synthesis via CA and Lineart along with a progressive training strategy. Key contributions include the first all-in-one ReID architecture capable of handling four modalities, a modality-agnostic learning paradigm, and extensive zero-shot evaluations showing competitive or superior performance to existing baselines on cross-modal and multimodal tasks. The approach enables robust zero-shot generalization in wild, uncertain environments, with practical implications for surveillance and other multimodal retrieval applications, while acknowledging computational complexity and suggesting avenues for future efficiency improvements.

Abstract

In Re-identification (ReID), recent advancements yield noteworthy progress in both unimodal and cross-modal retrieval tasks. However, the challenge persists in developing a unified framework that could effectively handle varying multimodal data, including RGB, infrared, sketches, and textual information. Additionally, the emergence of large-scale models shows promising performance in various vision tasks but the foundation model in ReID is still blank. In response to these challenges, a novel multimodal learning paradigm for ReID is introduced, referred to as All-in-One (AIO), which harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning. The diverse multimodal data in AIO are seamlessly tokenized into a unified space, allowing the modality-shared frozen encoder to extract identity-consistent features comprehensively across all modalities. Furthermore, a meticulously crafted ensemble of cross-modality heads is designed to guide the learning trajectory. AIO is the \textbf{first} framework to perform all-in-one ReID, encompassing four commonly used modalities. Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts, showcasing exceptional performance in zero-shot and domain generalization scenarios.

All in One Framework for Multimodal Re-identification in the Wild

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 3 figures, 9 tables)

This paper contains 15 sections, 7 equations, 3 figures, 9 tables.

Introduction
Related Work
Cross-modal ReID
Multimodal Learning
Foundation Model
All in One Framework
Multimodal Tokenizer
Missing Modality Synthesis
Multimodal Modeling and Binding
Overall Architecture
Experiment
Experimental settings
Ablation Study
Evaluation on Multimodal ReID
Conclusion

Figures (3)

Figure 1: Illustration of the proposed AIO and existing methods. (a) Existing ReID methods ye2020dynamiczhu2021dsslchen2023modalityagnostic independently learn the cross-modal ReID models, incapable of handling the uncertain input modalities in real-world scenarios. (b) Our proposed AIO framework exhibits the capability to proficiently manage diverse combinations of input modalities, thus addressing the inherent uncertainties prevalent in practical deployment scenarios.
Figure 2: The schematic of the proposed AIO framework. VA: Vision Guided Masked Attribute Modeling head, FB: Feature Binding head, CE: Classification head. Our framework mainly contains three parts: I) a learnable multimodal tokenizer to project diverse modalities into a unified embedding, II) a frozen foundation modal to extract complementary cross-modal representations, and III) several cross-modal heads used to dig cross-modality relationships. In order to alleviate the missing modality problem, we also leverage Channel Augmentation ye2024channel and Lineart von2022diffusers to synthesize IR and sketch images that are missing.
Figure 3: The generated synthetic Sketch and IR images. We also visualize the feature distribution of RGB, IR, Sketch, and synthesized images.

All in One Framework for Multimodal Re-identification in the Wild

TL;DR

Abstract

All in One Framework for Multimodal Re-identification in the Wild

Authors

TL;DR

Abstract

Table of Contents

Figures (3)