SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking
Yingjia Xu, Jinlin Wu, Daming Gao, Zhen Chen, Yang Yang, Min Cao, Mang Ye, Zhen Lei
TL;DR
SA-Person addresses the challenge of text-based person retrieval in complex scenes by jointly leveraging target appearance and global scene context. It introduces ScenePerson-13W, a large-scale dataset with rich appearance and scene annotations, and a two-stage framework: appearance-grounded retrieval followed by a training-free scene-aware re-ranking module called SceneRanker. Across ScenePerson-13W and other benchmarks, SA-Person consistently surpasses state-of-the-art methods, demonstrating the value of integrating region-level grounding with holistic scene reasoning. The approach achieves strong retrieval performance with scalable inference by restricting expensive MLLM processing to a small candidate set, enabling practical deployment in real-world, full-scene galleries.
Abstract
Text-based person retrieval aims to identify a target individual from an image gallery using a natural language description. Existing methods primarily focus on appearance-driven cross-modal retrieval, yet face significant challenges due to the visual complexity of scenes and the inherent ambiguity of textual descriptions. The contextual information, such as landmarks and relational cues, provides complementary cues that can offer valuable complementary insights for retrieval, but remains underexploited in current approaches. Motivated by this limitation, we propose a novel paradigm: scene-aware text-based person retrieval, which explicitly integrates both individual appearance and global scene context to improve retrieval accuracy. To support this, we first introduce ScenePerson-13W, a large-scale benchmark dataset comprising over 100,000 real-world scenes with rich annotations encompassing both pedestrian attributes and scene context. Based on this dataset, we further present SA-Person, a two-stage retrieval framework. In the first stage, SA-Person performs discriminative appearance grounding by aligning textual descriptions with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking module that refines retrieval results by jointly reasoning over pedestrian appearance and the global scene context. Extensive experiments on ScenePerson-13W and existing benchmarks demonstrate the effectiveness of our proposed SA-Person. Both the dataset and code will be publicly released to facilitate future research.
