Table of Contents
Fetching ...

Data Augmentation for Text-based Person Retrieval Using Large Language Models

Zheng Li, Lijia Si, Caili Guo, Yang Yang, Qiushi Cao

TL;DR

This work tackles the data scarcity challenge in text-based person retrieval (TPR) by introducing LLM-DA, a large language model–driven data augmentation framework. It rewrites existing text queries with an LLM to generate augmented captions, and then applies a Text Faithfulness Filter (TFF) to discard semantically inconsistent rewrites, followed by a Balanced Sampling Strategy (BSS) to mix original and augmented captions during training. The method is model-agnostic and yields plug-and-play integration with CLIP-based TPR models, achieving consistent improvements across three benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid), with larger gains observed for stronger backbones. Key contributions include (1) first use of LLMs for TPR text augmentation, (2) a TFF to mitigate hallucinations, and (3) a BSS to balance data contributions, collectively enhancing cross-modal retrieval performance and data diversity.

Abstract

Text-based Person Retrieval (TPR) aims to retrieve person images that match the description given a text query. The performance improvement of the TPR model relies on high-quality data for supervised training. However, it is difficult to construct a large-scale, high-quality TPR dataset due to expensive annotation and privacy protection. Recently, Large Language Models (LLMs) have approached or even surpassed human performance on many NLP tasks, creating the possibility to expand high-quality TPR datasets. This paper proposes an LLM-based Data Augmentation (LLM-DA) method for TPR. LLM-DA uses LLMs to rewrite the text in the current TPR dataset, achieving high-quality expansion of the dataset concisely and efficiently. These rewritten texts are able to increase the diversity of vocabulary and sentence structure while retaining the original key concepts and semantic information. In order to alleviate the hallucinations of LLMs, LLM-DA introduces a Text Faithfulness Filter (TFF) to filter out unfaithful rewritten text. To balance the contributions of original text and augmented text, a Balanced Sampling Strategy (BSS) is proposed to control the proportion of original text and augmented text used for training. LLM-DA is a plug-and-play method that can be easily integrated into various TPR models. Comprehensive experiments on three TPR benchmarks show that LLM-DA can improve the retrieval performance of current TPR models.

Data Augmentation for Text-based Person Retrieval Using Large Language Models

TL;DR

This work tackles the data scarcity challenge in text-based person retrieval (TPR) by introducing LLM-DA, a large language model–driven data augmentation framework. It rewrites existing text queries with an LLM to generate augmented captions, and then applies a Text Faithfulness Filter (TFF) to discard semantically inconsistent rewrites, followed by a Balanced Sampling Strategy (BSS) to mix original and augmented captions during training. The method is model-agnostic and yields plug-and-play integration with CLIP-based TPR models, achieving consistent improvements across three benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid), with larger gains observed for stronger backbones. Key contributions include (1) first use of LLMs for TPR text augmentation, (2) a TFF to mitigate hallucinations, and (3) a BSS to balance data contributions, collectively enhancing cross-modal retrieval performance and data diversity.

Abstract

Text-based Person Retrieval (TPR) aims to retrieve person images that match the description given a text query. The performance improvement of the TPR model relies on high-quality data for supervised training. However, it is difficult to construct a large-scale, high-quality TPR dataset due to expensive annotation and privacy protection. Recently, Large Language Models (LLMs) have approached or even surpassed human performance on many NLP tasks, creating the possibility to expand high-quality TPR datasets. This paper proposes an LLM-based Data Augmentation (LLM-DA) method for TPR. LLM-DA uses LLMs to rewrite the text in the current TPR dataset, achieving high-quality expansion of the dataset concisely and efficiently. These rewritten texts are able to increase the diversity of vocabulary and sentence structure while retaining the original key concepts and semantic information. In order to alleviate the hallucinations of LLMs, LLM-DA introduces a Text Faithfulness Filter (TFF) to filter out unfaithful rewritten text. To balance the contributions of original text and augmented text, a Balanced Sampling Strategy (BSS) is proposed to control the proportion of original text and augmented text used for training. LLM-DA is a plug-and-play method that can be easily integrated into various TPR models. Comprehensive experiments on three TPR benchmarks show that LLM-DA can improve the retrieval performance of current TPR models.
Paper Structure (17 sections, 5 equations, 8 figures, 5 tables)

This paper contains 17 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Original person image, original text, and augmented text.
  • Figure 2: The framework for using LLM-based Data Augmentation (LLM-DA) in TPR model training.
  • Figure 3: Using LLM for text augmentation.
  • Figure 4: Distribution of $s(T_{i}^{ori}, T_{i}^{aug})$ on the CUHK-PEDES dataset.
  • Figure 5: Text Faithfulness Filter (TFF).
  • ...and 3 more figures