Table of Contents
Fetching ...

An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

Min Cao, ZiYin Zeng, YuXin Lu, Mang Ye, Dong Yi, Jinqiao Wang

TL;DR

The paper addresses privacy and annotation challenges in Text-Based Person Retrieval (TBPR) by empirically validating synthetic data as a data-centric alternative. It introduces two pipelines—inter-class image generation with automatic prompts and intra-class image augmentation via diffusion-model editing—coupled with automatic text generation from Multimodal Large Language Models, evaluated across no-data, limited-data, and abundant-data scenarios using a lightweight TBPR baseline with $L_{1}$ and $L_{2}$ losses and noise-robust strategies. Across three benchmarks, synthetic data yields consistent gains, with ablations showing prompt quality and background-focused intra-class edits as key drivers, and variable instruction text providing richer descriptions that boost retrieval performance. The study also explores denoising and cross-domain/rare-environment applications, demonstrating practical viability and offering code and a synthetic dataset to facilitate TBPR research in privacy-sensitive settings.

Abstract

Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy-sensitive and labor-intensive issues. Several pioneering efforts explore synthetic data for TBPR but still rely on real data, keeping the aforementioned issues and also resulting in diversity-deficient issue in synthetic datasets, thus impacting TBPR performance. Moreover, these works tend to explore synthetic data for TBPR through limited perspectives, leading to exploration-restricted issue. In this paper, we conduct an empirical study to explore the potential of synthetic data for TBPR, highlighting three key aspects. (1) We propose an inter-class image generation pipeline, in which an automatic prompt construction strategy is introduced to guide generative Artificial Intelligence (AI) models in generating various inter-class images without reliance on original data. (2) We develop an intra-class image augmentation pipeline, in which the generative AI models are applied to further edit the images for obtaining various intra-class images. (3) Building upon the proposed pipelines and an automatic text generation pipeline, we explore the effectiveness of synthetic data in diverse scenarios through extensive experiments. Additionally, we experimentally investigate various noise-robust learning strategies to mitigate the inherent noise in synthetic data. We will release the code, along with the synthetic large-scale dataset generated by our pipelines, which are expected to advance practical TBPR research.

An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

TL;DR

The paper addresses privacy and annotation challenges in Text-Based Person Retrieval (TBPR) by empirically validating synthetic data as a data-centric alternative. It introduces two pipelines—inter-class image generation with automatic prompts and intra-class image augmentation via diffusion-model editing—coupled with automatic text generation from Multimodal Large Language Models, evaluated across no-data, limited-data, and abundant-data scenarios using a lightweight TBPR baseline with and losses and noise-robust strategies. Across three benchmarks, synthetic data yields consistent gains, with ablations showing prompt quality and background-focused intra-class edits as key drivers, and variable instruction text providing richer descriptions that boost retrieval performance. The study also explores denoising and cross-domain/rare-environment applications, demonstrating practical viability and offering code and a synthetic dataset to facilitate TBPR research in privacy-sensitive settings.

Abstract

Data plays a pivotal role in Text-Based Person Retrieval (TBPR) research. Mainstream research paradigm necessitates real-world person images with manual textual annotations for training models, posing privacy-sensitive and labor-intensive issues. Several pioneering efforts explore synthetic data for TBPR but still rely on real data, keeping the aforementioned issues and also resulting in diversity-deficient issue in synthetic datasets, thus impacting TBPR performance. Moreover, these works tend to explore synthetic data for TBPR through limited perspectives, leading to exploration-restricted issue. In this paper, we conduct an empirical study to explore the potential of synthetic data for TBPR, highlighting three key aspects. (1) We propose an inter-class image generation pipeline, in which an automatic prompt construction strategy is introduced to guide generative Artificial Intelligence (AI) models in generating various inter-class images without reliance on original data. (2) We develop an intra-class image augmentation pipeline, in which the generative AI models are applied to further edit the images for obtaining various intra-class images. (3) Building upon the proposed pipelines and an automatic text generation pipeline, we explore the effectiveness of synthetic data in diverse scenarios through extensive experiments. Additionally, we experimentally investigate various noise-robust learning strategies to mitigate the inherent noise in synthetic data. We will release the code, along with the synthetic large-scale dataset generated by our pipelines, which are expected to advance practical TBPR research.

Paper Structure

This paper contains 28 sections, 5 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Data production paradigms for TBPR model training. (a) Data from a repository of real-world person images accompanied by manual textual annotations. (b) Data produced by generative AI models with the assistance of real-world person images or manual textual annotations. (c) Our proposed data production paradigm centered around three representative scenarios.
  • Figure 2: Workflow of our framework for validating synthetic data for TBPR. It involve the following steps. (1) Inter-class image generation: producing diverse person images with different identities; (2) Intra-class image augmentation: further generating multiple images of the same identities; (3) Text generation: extracting textual descriptions of the person images; (4) Baseline model: training the model using synthetic data alongside original real data (if accessible). The framework is performed across three representative scenarios.
  • Figure 3: Performance trend under different value of the guidance scale on CUHK-PEDES.
  • Figure 4: Illustration of real data (a) and synthetic data (b)$\sim$(c).
  • Figure 5: Descriptor lists used in the rough description templates for inter-class image generation.
  • ...and 8 more figures