Automatic Synthetic Data and Fine-grained Adaptive Feature Alignment for Composed Person Retrieval
Delong Liu, Haiwen Li, Zhaohui Hou, Zhicheng Zhao, Fei Su, Yuan Dong
TL;DR
This paper tackles Composed Person Retrieval (CPR), which combines visual and textual queries to identify individuals in large image collections, addressing data scarcity with a scalable SynCPR data synthesis pipeline and a rigorous filtering process. It introduces FAFA, a fine-grained adaptive feature alignment framework with dynamic feature alignment (FDA), feature diversity (FD), and masked feature reasoning (MFR) to enable end-to-end CPR. A million-scale synthetic dataset SynCPR and a carefully annotated ITCPR test set are released to enable training and evaluation, with experiments showing substantial improvements over image-only, text-only, and cross-modal baselines, including on CIR-derived tasks. The work demonstrates the practical impact of integrating multimodal synthesis and fine-grained alignment for robust, scalable composed person retrieval, with broad implications for surveillance, forensics, and multimedia search while providing code and data access for reproducibility.
Abstract
Person retrieval has attracted rising attention. Existing methods are mainly divided into two retrieval modes, namely image-only and text-only. However, they are unable to make full use of the available information and are difficult to meet diverse application requirements. To address the above limitations, we propose a new Composed Person Retrieval (CPR) task, which combines visual and textual queries to identify individuals of interest from large-scale person image databases. Nevertheless, the foremost difficulty of the CPR task is the lack of available annotated datasets. Therefore, we first introduce a scalable automatic data synthesis pipeline, which decomposes complex multimodal data generation into the creation of textual quadruples followed by identity-consistent image synthesis using fine-tuned generative models. Meanwhile, a multimodal filtering method is designed to ensure the resulting SynCPR dataset retains 1.15 million high-quality and fully synthetic triplets. Additionally, to improve the representation of composed person queries, we propose a novel Fine-grained Adaptive Feature Alignment (FAFA) framework through fine-grained dynamic alignment and masked feature reasoning. Moreover, for objective evaluation, we manually annotate the Image-Text Composed Person Retrieval (ITCPR) test set. The extensive experiments demonstrate the effectiveness of the SynCPR dataset and the superiority of the proposed FAFA framework when compared with the state-of-the-art methods. All code and data will be provided at https://github.com/Delong-liu-bupt/Composed_Person_Retrieval.
