Table of Contents
Fetching ...

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Wentao Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, Dapeng Tao

TL;DR

This work tackles transferable text-to-image re-identification by generating a large-scale, MLLM-captioned dataset (LUPerson-MLLM) and training a CLIP-based model to transfer across datasets without target-domain labels. It introduces Template-based Diversity Enhancement (TDE) to diversify captions and Noise-aware Masking (NAM) to mitigate noise in MLLM-generated descriptions, all optimized with a bidirectional distribution-matching loss $ ext{L}_{sdm} = ext{L}_{i2t} + ext{L}_{t2i}$. Empirical results show strong direct transfer and superior fine-tuning performance, with ablations confirming the effectiveness of both TDE and NAM and the benefits of large-scale pretraining. The approach offers practical impact for scalable cross-domain ReID, though it acknowledges limits in template coverage and occasional masking failures, suggesting avenues for future refinement.

Abstract

Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

TL;DR

This work tackles transferable text-to-image re-identification by generating a large-scale, MLLM-captioned dataset (LUPerson-MLLM) and training a CLIP-based model to transfer across datasets without target-domain labels. It introduces Template-based Diversity Enhancement (TDE) to diversify captions and Noise-aware Masking (NAM) to mitigate noise in MLLM-generated descriptions, all optimized with a bidirectional distribution-matching loss . Empirical results show strong direct transfer and superior fine-tuning performance, with ablations confirming the effectiveness of both TDE and NAM and the benefits of large-scale pretraining. The approach offers practical impact for scalable cross-domain ReID, though it acknowledges limits in template coverage and occasional masking failures, suggesting avenues for future refinement.

Abstract

Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
Paper Structure (12 sections, 7 equations, 5 figures, 5 tables)

This paper contains 12 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of textual descriptions generated by an MLLM (i.e., Qwen bai2023qwen). (Top) The description patterns are similar for different images. (Bottom) Our proposed Template-based Diversity Enhancement (TDE) method significantly enhances the description pattern diversity. It is worth noting that some errors are present in the generated descriptions shown in this figure.
  • Figure 2: Overview of our framework. We adopt the CLIP-ViT/B-16 model as the backbone. Our framework uses one pedestrian image, the original textual description $T^{full}$, and a masked textual description $T^{nam}$ as input during training. $T^{nam}$ is obtained by applying NAM to $T^{full}$. To perform NAM, we first compute the similarity matrix $\mathbf{S}$ between the text tokens $\mathbf{F_\textit{t}}$ of $T^{full}$ and the image tokens $\mathbf{F_\textit{v}}$ according to their embeddings at the $l$-th layer of the encoders. Then, we estimate the probability of each text token's noisiness according to the similarity between its embedding and the image token embeddings. The similarity distribution matching (SDM) loss is computed between the global visual feature $\bm{v}_{cls}$ of the pedestrian image and the global textual feature $\bm{t'}_{eos}$ of $T^{nam}$. The model's optimization quality is enhanced by masking noisy words in $T^{full}$. (Best viewed in color.)
  • Figure 3: Results of different layers for NAM to compute $S$. The encoders contain 12 layers in total. Best viewed with zoom-in.
  • Figure 4: Results of different overall masking ratios $p$ for NAM. 'EM' represents masking all text tokens with the same probability $p$. Best viewed with zoom-in.
  • Figure 5: Training data size's impact on our methods' direct transfer ReID performance. '0 M' refers to directly using the original CLIP encoders.