Table of Contents
Fetching ...

A Dual-stage Prompt-driven Privacy-preserving Paradigm for Person Re-Identification

Ruolin Li, Min Liu, Yuan Bian, Zhaoyang Li, Yuzhen Li, Xueping Wang, Yaonan Wang

TL;DR

This work tackles privacy concerns in person re-identification by replacing real imagery with a privacy-preserving synthetic dataset generated via a text-driven diffusion model. It introduces DPPP, a dual-stage pipeline: Stage 1 uses rich prompts to synthesize GenePerson with diverse appearances and scenes, and Stage 2 employs a Prompt-driven Disentanglement Mechanism (PDM) that learns style and content pseudo-words to extract domain-invariant content features through CLIP-based contrastive learning. The approach yields state-of-the-art cross-domain generalization on Market-1501 and DukeMTMC-reID, outperforming both real and other synthetic datasets; its best results come from training GenePerson with PDM. By enabling end-to-end virtual data generation and disentangled, content-focused representation learning, the method reduces privacy risks while preserving strong Re-ID performance and offers a scalable path for privacy-safe cross-domain recognition.

Abstract

With growing concerns over data privacy, researchers have started using virtual data as an alternative to sensitive real-world images for training person re-identification (Re-ID) models. However, existing virtual datasets produced by game engines still face challenges such as complex construction and poor domain generalization, making them difficult to apply in real scenarios. To address these challenges, we propose a Dual-stage Prompt-driven Privacy-preserving Paradigm (DPPP). In the first stage, we generate rich prompts incorporating multi-dimensional attributes such as pedestrian appearance, illumination, and viewpoint that drive the diffusion model to synthesize diverse data end-to-end, building a large-scale virtual dataset named GenePerson with 130,519 images of 6,641 identities. In the second stage, we propose a Prompt-driven Disentanglement Mechanism (PDM) to learn domain-invariant generalization features. With the aid of contrastive learning, we employ two textual inversion networks to map images into pseudo-words representing style and content, respectively, thereby constructing style-disentangled content prompts to guide the model in learning domain-invariant content features at the image level. Experiments demonstrate that models trained on GenePerson with PDM achieve state-of-the-art generalization performance, surpassing those on popular real and virtual Re-ID datasets.

A Dual-stage Prompt-driven Privacy-preserving Paradigm for Person Re-Identification

TL;DR

This work tackles privacy concerns in person re-identification by replacing real imagery with a privacy-preserving synthetic dataset generated via a text-driven diffusion model. It introduces DPPP, a dual-stage pipeline: Stage 1 uses rich prompts to synthesize GenePerson with diverse appearances and scenes, and Stage 2 employs a Prompt-driven Disentanglement Mechanism (PDM) that learns style and content pseudo-words to extract domain-invariant content features through CLIP-based contrastive learning. The approach yields state-of-the-art cross-domain generalization on Market-1501 and DukeMTMC-reID, outperforming both real and other synthetic datasets; its best results come from training GenePerson with PDM. By enabling end-to-end virtual data generation and disentangled, content-focused representation learning, the method reduces privacy risks while preserving strong Re-ID performance and offers a scalable path for privacy-safe cross-domain recognition.

Abstract

With growing concerns over data privacy, researchers have started using virtual data as an alternative to sensitive real-world images for training person re-identification (Re-ID) models. However, existing virtual datasets produced by game engines still face challenges such as complex construction and poor domain generalization, making them difficult to apply in real scenarios. To address these challenges, we propose a Dual-stage Prompt-driven Privacy-preserving Paradigm (DPPP). In the first stage, we generate rich prompts incorporating multi-dimensional attributes such as pedestrian appearance, illumination, and viewpoint that drive the diffusion model to synthesize diverse data end-to-end, building a large-scale virtual dataset named GenePerson with 130,519 images of 6,641 identities. In the second stage, we propose a Prompt-driven Disentanglement Mechanism (PDM) to learn domain-invariant generalization features. With the aid of contrastive learning, we employ two textual inversion networks to map images into pseudo-words representing style and content, respectively, thereby constructing style-disentangled content prompts to guide the model in learning domain-invariant content features at the image level. Experiments demonstrate that models trained on GenePerson with PDM achieve state-of-the-art generalization performance, surpassing those on popular real and virtual Re-ID datasets.

Paper Structure

This paper contains 15 sections, 16 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Motivation of our method. (a) Our method generates diverse virtual samples with only text, greatly simplifying the dataset construction process. (b) PDM learns domain-invariant visual content features under the guidance of prompts in the joint vision-language space.
  • Figure 2: Overview of our framework, which consists of two parts. (a) The prompt-driven virtual image generation pipeline shows the generation of our dataset GenePerson. (b) The disentanglement stage first uses text to effectively capture the style and content information of the image, and then utilizes style-disentangled prompts as a guide for disentangling the visual representations.
  • Figure 3: Pedestrians with different postures in the GenePerson dataset. Generate corresponding virtual samples based on the given poses.
  • Figure 4: Some of the image samples in the proposed GenePerson dataset, including 3 different pedestrians in different background scenes, and pedestrians in 9 different illumination conditions.
  • Figure 5: At inference time, the trained visual encoder and de-stylization projector are used to extract image content features.
  • ...and 1 more figures