Table of Contents
Fetching ...

CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval

Hang Yu, Jiahao Wen, Zhedong Zheng

TL;DR

This paper tackles the domain gap between synthetic pretraining data and real-world data in text-based person retrieval. It proposes CAMeL, a domain-agnostic pretraining framework that combines stylization tasks, a dynamic error sample memory, cross-modality meta-learning, and an adaptive dual-speed update to learn robust, cross-modal representations. The approach achieves competitive or state-of-the-art recall and mAP on CUHK-PEDES, ICFG-PEDES, and RSTPReid, while maintaining efficiency and demonstrating strong zero-shot and domain-migration robustness. Overall, CAMeL advances scalable, generalizable text-based person retrieval by explicitly addressing synthetic data biases and cross-modal adaptation in pretraining.

Abstract

Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.

CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval

TL;DR

This paper tackles the domain gap between synthetic pretraining data and real-world data in text-based person retrieval. It proposes CAMeL, a domain-agnostic pretraining framework that combines stylization tasks, a dynamic error sample memory, cross-modality meta-learning, and an adaptive dual-speed update to learn robust, cross-modal representations. The approach achieves competitive or state-of-the-art recall and mAP on CUHK-PEDES, ICFG-PEDES, and RSTPReid, while maintaining efficiency and demonstrating strong zero-shot and domain-migration robustness. Overall, CAMeL advances scalable, generalizable text-based person retrieval by explicitly addressing synthetic data biases and cross-modal adaptation in pretraining.

Abstract

Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Domain biases are observed between the real-world dataset, CUHK-PEDES (top) li2017person1, and the synthetic dataset, MALS (bottom) yang2023towards11. The visual domain gap includes facial texture defects, resolution differences, and variations in illumination and color. Text annotations also exhibit bias, with MALS favoring gerunds such as "standing" and "posing," while CUHK-PEDES uses more specific verbs, e.g., "wears."
  • Figure 2: Overview of the proposed domain-agnostic pretraining on the synthetic dataset, i.e., MALS. (1) We initially design stylized image tasks involving dynamic illumination, image blurring and adaptive memory, while we apply text augmentation to simulate the real-world natural language inputs. Then augmented image-text pairs are fed into the encoders, and calculate the image-text contrastive loss (ITC) and image-text matching loss (ITM). (2) Subsequently, guided by the meta-learning strategy, model parameters are optimized through gradient updates directed by the loss function, adapting to diverse task requirements. The red dashed line represents the task-specific updates, reflecting the model's rapid optimization in specific tasks (lines 6-10 in Alg.\ref{['algorithm:1']}). The gray dashed line represents the fast update, which helps the model quickly adapt to new tasks by adjusting global parameters (line 14 in Alg.\ref{['algorithm:1']}). The black line represents the slow update, ensuring gradual convergence through global optimization (line 17 in Alg.\ref{['algorithm:1']}).
  • Figure 3: Qualitative comparison of text-to-image retrieval results between Ours (CAMeL) and the Baseline on the benchmark datasets, with results ordered by similarity from highest to lowest, left to right. Correct matches are highlighted with a green frame, while incorrect matches are marked in red. The green-highlighted text emphasizes the details accurately captured by our approach.
  • Figure 4: Ablation Study on the memory capacity in our CAMeL. We apply 5%, 10%, 20%, 30%, 50% and 100% data pairs to pre-train, and then report the fine-tuned performance on CUHK-PEDES dataset. The percentage refers to the current capacity relative to the sample size extracted for dynamic illumination and blurring tasks.
  • Figure 5: An example of person retrieval results based on text with randomly masking words is depicted. The retrieved images are arranged from left to right in descending order from R1 to R5. The results validate that increasing the number of deleted words does not impact the precision of our retrieval, confirming the robustness of the CAMeL. The top image-text pair represents the original retrieval result. Green boxes indicate correct matches, while images in red boxes represent incorrect matches.
  • ...and 3 more figures