Table of Contents
Fetching ...

FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World

Zengli Luo, Canlong Zhang, Xiaochun Lu, Zhixin Li

TL;DR

This work tackles text-based pedestrian retrieval in open-world settings, where zero-shot generalization and evolving, multi-turn queries cause semantic drift and limited adaptability. It introduces FitPro, a three-module framework integrating Feature Contrastive Decoding (FCD) for structured text-image descriptions, Incremental Semantic Mining (ISM) for dynamic knowledge-graph modeling across views and turns, and Query-aware Hierarchical Retrieval (QHR) for adaptive, graph-guided multi-modal retrieval. Across five datasets and two evaluation protocols, FitPro achieves state-of-the-art zero-shot performance, robustly handles interactive queries, and demonstrates strong cross-scene generalization, aided by structure-aware diffusion and KG-based reasoning. The results indicate practical potential for open-world surveillance systems that require flexible, scalable, and reliable text-based person retrieval.

Abstract

Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions, thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment.

FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World

TL;DR

This work tackles text-based pedestrian retrieval in open-world settings, where zero-shot generalization and evolving, multi-turn queries cause semantic drift and limited adaptability. It introduces FitPro, a three-module framework integrating Feature Contrastive Decoding (FCD) for structured text-image descriptions, Incremental Semantic Mining (ISM) for dynamic knowledge-graph modeling across views and turns, and Query-aware Hierarchical Retrieval (QHR) for adaptive, graph-guided multi-modal retrieval. Across five datasets and two evaluation protocols, FitPro achieves state-of-the-art zero-shot performance, robustly handles interactive queries, and demonstrates strong cross-scene generalization, aided by structure-aware diffusion and KG-based reasoning. The results indicate practical potential for open-world surveillance systems that require flexible, scalable, and reliable text-based person retrieval.

Abstract

Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions, thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment.

Paper Structure

This paper contains 19 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of four TPR paradigms.
  • Figure 2: The proposed FitPro framework for text-based zero-shot interactive person retrieval in open scenarios.
  • Figure 3: The proposed Feature Contrastive Decoding (FCD) module.
  • Figure 4: The proposed Incremental Semantic Mining (ISM) module.
  • Figure 5: Visualization of Different Image Denoising and Lossless Upscaling Strategies in the FCD Module. The restored details mainly enhance pedestrian-related textures (appearance, clothing, and accessories) rather than introducing noise.
  • ...and 1 more figures