Table of Contents
Fetching ...

SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

Yingjia Xu, Jinlin Wu, Daming Gao, Zhen Chen, Yang Yang, Min Cao, Mang Ye, Zhen Lei

TL;DR

SA-Person addresses the challenge of text-based person retrieval in complex scenes by jointly leveraging target appearance and global scene context. It introduces ScenePerson-13W, a large-scale dataset with rich appearance and scene annotations, and a two-stage framework: appearance-grounded retrieval followed by a training-free scene-aware re-ranking module called SceneRanker. Across ScenePerson-13W and other benchmarks, SA-Person consistently surpasses state-of-the-art methods, demonstrating the value of integrating region-level grounding with holistic scene reasoning. The approach achieves strong retrieval performance with scalable inference by restricting expensive MLLM processing to a small candidate set, enabling practical deployment in real-world, full-scene galleries.

Abstract

Text-based person retrieval aims to identify a target individual from an image gallery using a natural language description. Existing methods primarily focus on appearance-driven cross-modal retrieval, yet face significant challenges due to the visual complexity of scenes and the inherent ambiguity of textual descriptions. The contextual information, such as landmarks and relational cues, provides complementary cues that can offer valuable complementary insights for retrieval, but remains underexploited in current approaches. Motivated by this limitation, we propose a novel paradigm: scene-aware text-based person retrieval, which explicitly integrates both individual appearance and global scene context to improve retrieval accuracy. To support this, we first introduce ScenePerson-13W, a large-scale benchmark dataset comprising over 100,000 real-world scenes with rich annotations encompassing both pedestrian attributes and scene context. Based on this dataset, we further present SA-Person, a two-stage retrieval framework. In the first stage, SA-Person performs discriminative appearance grounding by aligning textual descriptions with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking module that refines retrieval results by jointly reasoning over pedestrian appearance and the global scene context. Extensive experiments on ScenePerson-13W and existing benchmarks demonstrate the effectiveness of our proposed SA-Person. Both the dataset and code will be publicly released to facilitate future research.

SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

TL;DR

SA-Person addresses the challenge of text-based person retrieval in complex scenes by jointly leveraging target appearance and global scene context. It introduces ScenePerson-13W, a large-scale dataset with rich appearance and scene annotations, and a two-stage framework: appearance-grounded retrieval followed by a training-free scene-aware re-ranking module called SceneRanker. Across ScenePerson-13W and other benchmarks, SA-Person consistently surpasses state-of-the-art methods, demonstrating the value of integrating region-level grounding with holistic scene reasoning. The approach achieves strong retrieval performance with scalable inference by restricting expensive MLLM processing to a small candidate set, enabling practical deployment in real-world, full-scene galleries.

Abstract

Text-based person retrieval aims to identify a target individual from an image gallery using a natural language description. Existing methods primarily focus on appearance-driven cross-modal retrieval, yet face significant challenges due to the visual complexity of scenes and the inherent ambiguity of textual descriptions. The contextual information, such as landmarks and relational cues, provides complementary cues that can offer valuable complementary insights for retrieval, but remains underexploited in current approaches. Motivated by this limitation, we propose a novel paradigm: scene-aware text-based person retrieval, which explicitly integrates both individual appearance and global scene context to improve retrieval accuracy. To support this, we first introduce ScenePerson-13W, a large-scale benchmark dataset comprising over 100,000 real-world scenes with rich annotations encompassing both pedestrian attributes and scene context. Based on this dataset, we further present SA-Person, a two-stage retrieval framework. In the first stage, SA-Person performs discriminative appearance grounding by aligning textual descriptions with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking module that refines retrieval results by jointly reasoning over pedestrian appearance and the global scene context. Extensive experiments on ScenePerson-13W and existing benchmarks demonstrate the effectiveness of our proposed SA-Person. Both the dataset and code will be publicly released to facilitate future research.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of person retrieval challenges and scene-aware insights. (a) Illustration of appearance-based retrieval limitations, where multiple candidates match the text A girl wearing a green and white striped shirt is smiling due to similar appearances. (b) Demonstration of scene-aware reranking, ranking candidate images by aligning scene context like located in a restaurant, near a window with red text with the text. (c) Retrieval performance across three progressive input configurations on the augmented CUHK-SYSU dataset. The three configurations are: (1) Cropped + App (cropped image with appearance-only text), (2) Full + App (full image with appearance-only text), and (3) Full + Context (full image with context-enriched text).
  • Figure 2: Qualitative visualization of retrieval results across three progressive input configurations on the augmented CUHK-SYSU dataset. The blue frame contains the top candidates from the domain-specific model IRRA (ViT-L/14), and the orange frame contains the top candidates from the MLLM (InternVL-8B). The green-bordered image is the ground truth.
  • Figure 3: Overview of the ScenePerson-13W construction pipeline. The construction pipeline involves scene segmentation, pedestrian detection and tracking, completeness filtering, image deduplication, and description generation. For each retained pedestrian, a description is generated based on the full image with a highlighted target, capturing their appearance, spatial location, and relationships with surrounding elements.
  • Figure 4: Sunburst visualization of ScenePerson-13W, showing hierarchical feature distribution across Appearance, Landmarks, and Proximity.
  • Figure 5: Overview of the proposed SA-Person framework. The first stage, appearance-based person retrieval, aligns pedestrian-specific regions with appearance-related descriptions for initial retrieval. The second stage, scene-aware reranking, re-ranks the top-$K$ candidates using scene-aided text and the full-scene image.
  • ...and 3 more figures