Table of Contents
Fetching ...

PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

Jicheol Park, Dongwon Kim, Boseung Jeong, Suha Kwak

TL;DR

PLOT introduces a text-based person search framework that learns cross-modal part representations without explicit part supervision by leveraging a slot-attention-based Part Discovery Module and shared part slots across image and text modalities. A novel text-driven dynamic part attention (TDPA) adaptively weights part-slot contributions per query, improving fine-grained alignment beyond global representations. The model uses CLIP-based backbones with a multi-part, multi-loss objective that includes global NCE, PartNCE, ID losses, and a cross-modal masked language modeling term, achieving state-of-the-art results on three benchmarks and offering interpretable part-level correspondences. The approach results in strong retrieval accuracy with efficient inference, enabling scalable deployment for large image collections and natural language queries.

Abstract

Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.

PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

TL;DR

PLOT introduces a text-based person search framework that learns cross-modal part representations without explicit part supervision by leveraging a slot-attention-based Part Discovery Module and shared part slots across image and text modalities. A novel text-driven dynamic part attention (TDPA) adaptively weights part-slot contributions per query, improving fine-grained alignment beyond global representations. The model uses CLIP-based backbones with a multi-part, multi-loss objective that includes global NCE, PartNCE, ID losses, and a cross-modal masked language modeling term, achieving state-of-the-art results on three benchmarks and offering interpretable part-level correspondences. The approach results in strong retrieval accuracy with efficient inference, enabling scalable deployment for large image collections and natural language queries.

Abstract

Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.
Paper Structure (20 sections, 14 equations, 9 figures, 5 tables)

This paper contains 20 sections, 14 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The overall architecture of PLOT.
  • Figure 2: Illustration of the CMLM.
  • Figure 3: Top-5 retrieval results of our method on the CUHK-PEDES dataset. Images are sorted from left to right according to their ranks below each text query. Green and red boxes indicate true and false matches, respectively.
  • Figure 4: Visualization of each modality's attention map $\bar{A}_{k}$ in $T$-th iteration of PSA block and TDPA weights $\mathbf{a}$ on CUHK-PEDES dataset.
  • Figure A: Visualization of each modality's attention map $\bar{A}_{k}$ in $T$-th iteration of PSA block and TDPA weights $\mathbf{a}$ on CUHK-PEDES dataset.
  • ...and 4 more figures