Table of Contents
Fetching ...

ReText: Text Boosts Generalization in Image-Based Person Re-identification

Timur Mamedov, Karina Kvanchiani, Anton Konushin, Vadim Konushin

TL;DR

ReText tackles the generalization gap in image-based person Re-ID by training on a mixture of multi-camera data and captioned single-camera data, optimizing three tasks: Re-ID on multi-camera data, image-text matching on captioned single-camera data, and text-guided image reconstruction. It introduces two dedicated losses for cross-modal alignment, the Identity-aware Matching Loss and the Structure-preserving Loss, along with a reconstruction loss, and uses a ViT-based image encoder with a momentum encoder and a BERT text encoder. Extensive experiments across standard cross-domain protocols show that combining textual supervision with stylistic single-camera diversity yields consistent and substantial improvements over state-of-the-art methods. The findings highlight the value of language-informed semantic cues and cross-modal learning for robust, domain-invariant Re-ID in practical, unseen environments.

Abstract

Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.

ReText: Text Boosts Generalization in Image-Based Person Re-identification

TL;DR

ReText tackles the generalization gap in image-based person Re-ID by training on a mixture of multi-camera data and captioned single-camera data, optimizing three tasks: Re-ID on multi-camera data, image-text matching on captioned single-camera data, and text-guided image reconstruction. It introduces two dedicated losses for cross-modal alignment, the Identity-aware Matching Loss and the Structure-preserving Loss, along with a reconstruction loss, and uses a ViT-based image encoder with a momentum encoder and a BERT text encoder. Extensive experiments across standard cross-domain protocols show that combining textual supervision with stylistic single-camera diversity yields consistent and substantial improvements over state-of-the-art methods. The findings highlight the value of language-informed semantic cues and cross-modal learning for robust, domain-invariant Re-ID in practical, unseen environments.

Abstract

Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
Paper Structure (36 sections, 12 equations, 3 figures, 11 tables)

This paper contains 36 sections, 12 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Impact of single-camera data and textual supervision on cross-domain Re-ID performance ($mAP$). Adding single-camera data improves generalization (Ours w/o text) and further gains are achieved by incorporating textual descriptions (Ours with text), highlighting the effectiveness of multimodal supervision during training.
  • Figure 2: Overview of ReText. The model receives four types of inputs: multi-camera data, single-camera data, masked single-camera data, and textual captions. It jointly optimizes three loss functions: Re-ID loss on multi-camera data, image-text matching loss on single-camera data, and text-guided reconstruction loss on masked single-camera data.
  • Figure 3: Examples of image reconstruction in ReText. Each example shows the original image (left), masked input (middle), and the reconstructed output (right), alongside the corresponding caption. The model successfully recovers key visual details, highlighting the effectiveness of image-text training.