ReText: Text Boosts Generalization in Image-Based Person Re-identification
Timur Mamedov, Karina Kvanchiani, Anton Konushin, Vadim Konushin
TL;DR
ReText tackles the generalization gap in image-based person Re-ID by training on a mixture of multi-camera data and captioned single-camera data, optimizing three tasks: Re-ID on multi-camera data, image-text matching on captioned single-camera data, and text-guided image reconstruction. It introduces two dedicated losses for cross-modal alignment, the Identity-aware Matching Loss and the Structure-preserving Loss, along with a reconstruction loss, and uses a ViT-based image encoder with a momentum encoder and a BERT text encoder. Extensive experiments across standard cross-domain protocols show that combining textual supervision with stylistic single-camera diversity yields consistent and substantial improvements over state-of-the-art methods. The findings highlight the value of language-informed semantic cues and cross-modal learning for robust, domain-invariant Re-ID in practical, unseen environments.
Abstract
Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
