Table of Contents
Fetching ...

Leveraging Prior Knowledge of Diffusion Model for Person Search

Giyeol Kim, Sooyoung Yang, Jihyong Oh, Myungjoo Kang, Chanho Eom

TL;DR

DiffPS addresses the backbone conflict between detection and re-ID in person search by leveraging a frozen pre-trained diffusion model as a rich, task-agnostic prior. It introduces three specialized modules—DGRPN for diffusion-guided proposals, MSFRN for high-frequency refinement, and SFAN for text-aligned semantic aggregation—that extract and fuse diffusion priors without updating the backbone. The method achieves state-of-the-art results on CUHK-SYSU and PRW, including strong performance on occluded and small-scale instances, by mitigating shape bias and enhancing fine-grained details. This diffusion-prior framework offers a practical approach to improving generalization and robustness in person search with decoupled optimization and plug-and-play components.

Abstract

Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.

Leveraging Prior Knowledge of Diffusion Model for Person Search

TL;DR

DiffPS addresses the backbone conflict between detection and re-ID in person search by leveraging a frozen pre-trained diffusion model as a rich, task-agnostic prior. It introduces three specialized modules—DGRPN for diffusion-guided proposals, MSFRN for high-frequency refinement, and SFAN for text-aligned semantic aggregation—that extract and fuse diffusion priors without updating the backbone. The method achieves state-of-the-art results on CUHK-SYSU and PRW, including strong performance on occluded and small-scale instances, by mitigating shape bias and enhancing fine-grained details. This diffusion-prior framework offers a practical approach to improving generalization and robustness in person search with decoupled optimization and plug-and-play components.

Abstract

Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.

Paper Structure

This paper contains 45 sections, 6 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Detailed architecture of the UNet UNet in diffusion models SDStableDiffusion, comprising down-stage, mid-stage, and up-stage for hierarchical feature processing. (Best viewed in color.)
  • Figure 2: (a) Cross-attention maps highlighting different semantic regions based on textual queries. (b) PCA PCA visualization of features extracted from the ViT ViT block of $\mathbf{F}_u^3$ at different timesteps (t). (c) PCA visualization of up-stage feature maps ($\mathbf{F}_u^1, \mathbf{F}_u^2, \mathbf{F}_u^3$), with columns representing different up-stage levels and rows corresponding to features from ViT blocks within each level.
  • Figure 3: (a) Overview of DiffPS framework. DiffPS leverages a pre-trained diffusion model's UNet as the backbone, with three specialized modules: (b) DGRPN refines region proposals using cross-attention maps, (c) MSFRN enhances high-frequency details via multi-scale frequency refinement, and (d) SFAN incorporates text-aligned semantic features for re-ID. (Best viewed in color.)
  • Figure 4: Timestep analysis of diffusion features on PRW PRW. (a) Detection performance (AP) using three different layers from up-stage level 3 (blue, orange, green). (b) Re-ID performance (mAP) using features corresponding to (g), (b), and (c) from Table 3 (blue, orange, green). (Best viewed in color.)
  • Figure 5: Qualitative results of MSFRN and SFAN. (a) The first row shows features before MSFRN, while the second row presents the refined outputs, with the right column visualizing high-frequency components via DWT. (b) The first column displays original person crops, and the second column presents PCA PCA visualizations of features before being processed by SFAN. The third column shows the aggregated semantic maps $\sum_{c} \hat{\mathbf{S}}_{c}$, and the last presents $\mathbf{F}_{\text{sem}}$, the final semantic features.
  • ...and 4 more figures