Table of Contents
Fetching ...

Automatic Synthetic Data and Fine-grained Adaptive Feature Alignment for Composed Person Retrieval

Delong Liu, Haiwen Li, Zhaohui Hou, Zhicheng Zhao, Fei Su, Yuan Dong

TL;DR

This paper tackles Composed Person Retrieval (CPR), which combines visual and textual queries to identify individuals in large image collections, addressing data scarcity with a scalable SynCPR data synthesis pipeline and a rigorous filtering process. It introduces FAFA, a fine-grained adaptive feature alignment framework with dynamic feature alignment (FDA), feature diversity (FD), and masked feature reasoning (MFR) to enable end-to-end CPR. A million-scale synthetic dataset SynCPR and a carefully annotated ITCPR test set are released to enable training and evaluation, with experiments showing substantial improvements over image-only, text-only, and cross-modal baselines, including on CIR-derived tasks. The work demonstrates the practical impact of integrating multimodal synthesis and fine-grained alignment for robust, scalable composed person retrieval, with broad implications for surveillance, forensics, and multimedia search while providing code and data access for reproducibility.

Abstract

Person retrieval has attracted rising attention. Existing methods are mainly divided into two retrieval modes, namely image-only and text-only. However, they are unable to make full use of the available information and are difficult to meet diverse application requirements. To address the above limitations, we propose a new Composed Person Retrieval (CPR) task, which combines visual and textual queries to identify individuals of interest from large-scale person image databases. Nevertheless, the foremost difficulty of the CPR task is the lack of available annotated datasets. Therefore, we first introduce a scalable automatic data synthesis pipeline, which decomposes complex multimodal data generation into the creation of textual quadruples followed by identity-consistent image synthesis using fine-tuned generative models. Meanwhile, a multimodal filtering method is designed to ensure the resulting SynCPR dataset retains 1.15 million high-quality and fully synthetic triplets. Additionally, to improve the representation of composed person queries, we propose a novel Fine-grained Adaptive Feature Alignment (FAFA) framework through fine-grained dynamic alignment and masked feature reasoning. Moreover, for objective evaluation, we manually annotate the Image-Text Composed Person Retrieval (ITCPR) test set. The extensive experiments demonstrate the effectiveness of the SynCPR dataset and the superiority of the proposed FAFA framework when compared with the state-of-the-art methods. All code and data will be provided at https://github.com/Delong-liu-bupt/Composed_Person_Retrieval.

Automatic Synthetic Data and Fine-grained Adaptive Feature Alignment for Composed Person Retrieval

TL;DR

This paper tackles Composed Person Retrieval (CPR), which combines visual and textual queries to identify individuals in large image collections, addressing data scarcity with a scalable SynCPR data synthesis pipeline and a rigorous filtering process. It introduces FAFA, a fine-grained adaptive feature alignment framework with dynamic feature alignment (FDA), feature diversity (FD), and masked feature reasoning (MFR) to enable end-to-end CPR. A million-scale synthetic dataset SynCPR and a carefully annotated ITCPR test set are released to enable training and evaluation, with experiments showing substantial improvements over image-only, text-only, and cross-modal baselines, including on CIR-derived tasks. The work demonstrates the practical impact of integrating multimodal synthesis and fine-grained alignment for robust, scalable composed person retrieval, with broad implications for surveillance, forensics, and multimedia search while providing code and data access for reproducibility.

Abstract

Person retrieval has attracted rising attention. Existing methods are mainly divided into two retrieval modes, namely image-only and text-only. However, they are unable to make full use of the available information and are difficult to meet diverse application requirements. To address the above limitations, we propose a new Composed Person Retrieval (CPR) task, which combines visual and textual queries to identify individuals of interest from large-scale person image databases. Nevertheless, the foremost difficulty of the CPR task is the lack of available annotated datasets. Therefore, we first introduce a scalable automatic data synthesis pipeline, which decomposes complex multimodal data generation into the creation of textual quadruples followed by identity-consistent image synthesis using fine-tuned generative models. Meanwhile, a multimodal filtering method is designed to ensure the resulting SynCPR dataset retains 1.15 million high-quality and fully synthetic triplets. Additionally, to improve the representation of composed person queries, we propose a novel Fine-grained Adaptive Feature Alignment (FAFA) framework through fine-grained dynamic alignment and masked feature reasoning. Moreover, for objective evaluation, we manually annotate the Image-Text Composed Person Retrieval (ITCPR) test set. The extensive experiments demonstrate the effectiveness of the SynCPR dataset and the superiority of the proposed FAFA framework when compared with the state-of-the-art methods. All code and data will be provided at https://github.com/Delong-liu-bupt/Composed_Person_Retrieval.
Paper Structure (35 sections, 6 equations, 13 figures, 3 tables)

This paper contains 35 sections, 6 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of our contributions. (a) Comparison of the proposed composed person retrieval task with several classic person retrieval tasks. (b) Illustration of the proposed automatic high-quality CPR data synthesis pipeline, the proposed training framework FAFA, and the first carefully annotated test set in this domain, ITCPR. (c) Some examples from our fully synthetic SynCPR dataset.
  • Figure 2: Overall framework of our method. (a) The pipeline for synthesizing high-quality triplets, consisting of three key stages: generation of text quadruples, synthesis of person image pairs, and data filtering. (b) The structure of FAFA. The left part illustrates the training process of the model, while the right part highlights the key objectives employed by FAFA.
  • Figure 3: Example pairs of generated person images using different generative models and generation methods under the same text input.
  • Figure 4: Some representative examples from the ITCPR dataset.
  • Figure 5: Sensitivity analysis of FAFA on hyperparameters and analysis of the SynCPR dataset.
  • ...and 8 more figures