Table of Contents
Fetching ...

Minimizing the Pretraining Gap: Domain-aligned Text-Based Person Retrieval

Shuyu Yang, Yaxiong Wang, Yongrui Li, Li Zhu, Zhedong Zheng

Abstract

In this work, we focus on text-based person retrieval, which identifies individuals based on textual descriptions. Despite advancements enabled by synthetic data for pretraining, a significant domain gap, due to variations in lighting, color, and viewpoint, limits the effectiveness of the pretrain-finetune paradigm. To overcome this issue, we propose a unified pipeline incorporating domain adaptation at both image and region levels. Our method features two key components: Domain-aware Diffusion (DaD) for image-level adaptation, which aligns image distributions between synthetic and real-world domains, e.g., CUHK-PEDES, and Multi-granularity Relation Alignment (MRA) for region-level adaptation, which aligns visual regions with descriptive sentences, thereby addressing disparities at a finer granularity. This dual-level strategy effectively bridges the domain gap, achieving state-of-the-art performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/MRA.

Minimizing the Pretraining Gap: Domain-aligned Text-Based Person Retrieval

Abstract

In this work, we focus on text-based person retrieval, which identifies individuals based on textual descriptions. Despite advancements enabled by synthetic data for pretraining, a significant domain gap, due to variations in lighting, color, and viewpoint, limits the effectiveness of the pretrain-finetune paradigm. To overcome this issue, we propose a unified pipeline incorporating domain adaptation at both image and region levels. Our method features two key components: Domain-aware Diffusion (DaD) for image-level adaptation, which aligns image distributions between synthetic and real-world domains, e.g., CUHK-PEDES, and Multi-granularity Relation Alignment (MRA) for region-level adaptation, which aligns visual regions with descriptive sentences, thereby addressing disparities at a finer granularity. This dual-level strategy effectively bridges the domain gap, achieving state-of-the-art performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/MRA.

Paper Structure

This paper contains 17 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Selected images from synthetic data generated by Diffusion model yang2023towards, real-world data, i.e., CUHK-PEDES li2017person, and our proposed Synthetic Domain-Aligned dataset (SDA). We could observe that the visual gap between synthetic and real-world data (target domain) remains at illumination, color, viewpoints, etc. In contrast, images from SDA exhibit a target style while maintaining the high fidelity of the source image, characterized by a wide variety of variations in pose, appearance, background, etc. (Best viewed when zooming in.)
  • Figure 2: Overview of the proposed Domain-aware Diffusion (DaD) and the Synthetic Domain-Aligned dataset (SDA) construction. First, we obtain DaD by fine-tuning the diffusion model on the real-world target-domain image-text pair and deploy it for accomplishing image-level domain adaptation, followed by data filtering. Second, we construct a synthetic pedestrian image-text pair dataset, SDA, with region annotations using off-the-shelf tools, i.e., Image Captioner and Open-set Object Detector.
  • Figure 3: The low-quality medium samples, generated by DaD, are mainly removed by 1) computing the file size and the mean variance of the difference between the 3 channels of every image; 2) OpenPose.
  • Figure 4: More examples of our proposed SDA. One image-text pair usually carries 2-4 region-phrase annotations. (Best viewed when zooming in.)
  • Figure 5: Overview of the proposed Multi-granularity Relation Alignment framework (MRA). MRA first conducts (a) region-phrase encoding and image-text encoding by the shared Vision Encoder ($E_V$) and the shared Text Encoder ($E_T$). The model is constrained with cross-modal alignment at both the (b) region-phrase and (c) image-text levels, where the shared Fusion Encoder ($E_F$) seeks to fuse the vision and text embeddings for the subsequent predictions.
  • ...and 5 more figures