Table of Contents
Fetching ...

PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Jingdong Wang

TL;DR

PSDiff reframes person search as a dual denoising task within a diffusion framework, removing reliance on detector-generated proposals and enabling iterative, collaborative refinement between detection and Re-ID. A Collaborative Denoising Layer mediates interaction between boxes and embeddings, guiding a cascaded, end-to-end learning process. Empirical results on CUHK-SYSU and PRW show state-of-the-art mAP and top-1 performance with significantly fewer parameters and flexible computation via fast diffusion sampling. The work demonstrates the value of cross-task collaboration and diffusion-based conditioning for complex, multi-task vision problems, while acknowledging potential biases from learned supervision used for ground-truth embeddings. Overall, PSDiff offers a practical, scalable approach to high-precision person search with strong transferability across datasets.

Abstract

Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, pedestrian detection and Re-IDentification (ReID). Despite significant progress, current methods face two primary challenges: 1) the pedestrian candidates learned within detectors are suboptimal for the ReID task. 2) the potential for collaboration between two sub-tasks is overlooked. To address these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Distinct from the conventional Detection-to-ReID approach, our denoising paradigm discards prior pedestrian candidates generated by detectors, thereby avoiding the local optimum problem of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead.

PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

TL;DR

PSDiff reframes person search as a dual denoising task within a diffusion framework, removing reliance on detector-generated proposals and enabling iterative, collaborative refinement between detection and Re-ID. A Collaborative Denoising Layer mediates interaction between boxes and embeddings, guiding a cascaded, end-to-end learning process. Empirical results on CUHK-SYSU and PRW show state-of-the-art mAP and top-1 performance with significantly fewer parameters and flexible computation via fast diffusion sampling. The work demonstrates the value of cross-task collaboration and diffusion-based conditioning for complex, multi-task vision problems, while acknowledging potential biases from learned supervision used for ground-truth embeddings. Overall, PSDiff offers a practical, scalable approach to high-precision person search with strong transferability across datasets.

Abstract

Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, pedestrian detection and Re-IDentification (ReID). Despite significant progress, current methods face two primary challenges: 1) the pedestrian candidates learned within detectors are suboptimal for the ReID task. 2) the potential for collaboration between two sub-tasks is overlooked. To address these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Distinct from the conventional Detection-to-ReID approach, our denoising paradigm discards prior pedestrian candidates generated by detectors, thereby avoiding the local optimum problem of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead.
Paper Structure (41 sections, 14 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 41 sections, 14 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: (a) The baseline with prior pedestrian candidates ignores some crucial parts such as carry-on items, causing low ReID matching scores. (b) The baseline without collaboration misses results with high ReID matching scores due to inaccurate positions and low confidence scores. In comparison, our method can increase discriminative parts and refine two tasks collaboratively, producing more discriminative embeddings and more accurate detection results. "D" and "R" index confidence scores of detection and ReID matching scores with queries, respectively. The bounding boxes in green and red denote the correct and wrong results.
  • Figure 2: The overall architecture of the proposed PSDiff. During the training stage, $b_t,e_t$ are noisy boxes and embeddings calculated by Eq. \ref{['equ:noisybe']}. During the inference stage, $b_t,e_t$ are randomly generated boxes and ReID embeddings from Gaussian noises, which are refined gradually by the iterative inference with fast sampling methods, e.g., DDIM.
  • Figure 3: Illustrations of the collaborative interaction.
  • Figure 4: Comparison to different methods on CUHK-SYSU under varying gallery sizes.
  • Figure 5: Qualitative search results on the CUHK-SYSU dataset. The bounding boxes in yellow denote the queries while green and red denote the correct and wrong results."D" and "R" index confidence scores of detection and ReID matching scores with queries, respectively. "Rank-n" refers to the rank of the presented results among all predictions of the gallery.
  • ...and 1 more figures