Table of Contents
Fetching ...

Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

Lei Tan, Weihao Li, Pingyang Dai, Jie Chen, Liujuan Cao, Rongrong Ji

TL;DR

This work tackles text-based person search by identifying two key bottlenecks in cross-modal MLM: masking all words uniformly and relying on potentially noisy text descriptions. The authors propose the Attention-Guided Alignment (AGA) framework, featuring Attention-Guided Masking (AGM) to selectively mask semantically meaningful words via text-class attention and a Text Enrichment Module (TEM) to enrich descriptions by replacing masked words with MLM-driven predictions. The online-plus-momentum model design, cross-modal encoder, and ITC/ITM/MLM losses enable stable, effective cross-modal learning, with AGM and TEM providing complementary benefits—improved semantic alignment and richer textual descriptions. Empirical results on three TBPS benchmarks show state-of-the-art Rank-1 and mAP scores, with extensive ablations validating the contributions and design choices. Overall, AGM and TEM offer a practical, data-efficient way to enhance TBPS by sharpening cross-modal interaction and mitigating text quality issues.

Abstract

In the realm of Text-Based Person Search (TBPS), mainstream methods aim to explore more efficient interaction frameworks between text descriptions and visual data. However, recent approaches encounter two principal challenges. Firstly, the widely used random-based Masked Language Modeling (MLM) considers all the words in the text equally during training. However, massive semantically vacuous words ('with', 'the', etc.) be masked fail to contribute efficient interaction in the cross-modal MLM and hampers the representation alignment. Secondly, manual descriptions in TBPS datasets are tedious and inevitably contain several inaccuracies. To address these issues, we introduce an Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM). AGM dynamically masks semantically meaningful words by aggregating the attention weight derived from the text encoding process, thereby cross-modal MLM can capture information related to the masked word from text context and images and align their representations. Meanwhile, TEM alleviates low-quality representations caused by repetitive and erroneous text descriptions by replacing those semantically meaningful words with MLM's prediction. It not only enriches text descriptions but also prevents overfitting. Extensive experiments across three challenging benchmarks demonstrate the effectiveness of our AGA, achieving new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.

Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

TL;DR

This work tackles text-based person search by identifying two key bottlenecks in cross-modal MLM: masking all words uniformly and relying on potentially noisy text descriptions. The authors propose the Attention-Guided Alignment (AGA) framework, featuring Attention-Guided Masking (AGM) to selectively mask semantically meaningful words via text-class attention and a Text Enrichment Module (TEM) to enrich descriptions by replacing masked words with MLM-driven predictions. The online-plus-momentum model design, cross-modal encoder, and ITC/ITM/MLM losses enable stable, effective cross-modal learning, with AGM and TEM providing complementary benefits—improved semantic alignment and richer textual descriptions. Empirical results on three TBPS benchmarks show state-of-the-art Rank-1 and mAP scores, with extensive ablations validating the contributions and design choices. Overall, AGM and TEM offer a practical, data-efficient way to enhance TBPS by sharpening cross-modal interaction and mitigating text quality issues.

Abstract

In the realm of Text-Based Person Search (TBPS), mainstream methods aim to explore more efficient interaction frameworks between text descriptions and visual data. However, recent approaches encounter two principal challenges. Firstly, the widely used random-based Masked Language Modeling (MLM) considers all the words in the text equally during training. However, massive semantically vacuous words ('with', 'the', etc.) be masked fail to contribute efficient interaction in the cross-modal MLM and hampers the representation alignment. Secondly, manual descriptions in TBPS datasets are tedious and inevitably contain several inaccuracies. To address these issues, we introduce an Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM). AGM dynamically masks semantically meaningful words by aggregating the attention weight derived from the text encoding process, thereby cross-modal MLM can capture information related to the masked word from text context and images and align their representations. Meanwhile, TEM alleviates low-quality representations caused by repetitive and erroneous text descriptions by replacing those semantically meaningful words with MLM's prediction. It not only enriches text descriptions but also prevents overfitting. Extensive experiments across three challenging benchmarks demonstrate the effectiveness of our AGA, achieving new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.

Paper Structure

This paper contains 14 sections, 12 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of the motivation of our AGM. The upper part depicts the traditional random mask strategy which easily masks meaningless words, making it impossible to align the corresponding semantics of text context and image. The lower part is the proposed AGM strategy, which can locate semantically meaningful words, thus facilitating cross-modal alignment.
  • Figure 2: Model architecture of our method. AGA consists of an image encoder, a text encoder, and a cross-modal encoder. AGM aims to select meaningful words for masking by referring to class attention. TEM replaces the original words based on the logit of the MLM head, thereby enriching the original text description. The momentum model (a slow-moving of the online model) is used to guide the online model to learn better representations.
  • Figure 3: Visualization results of cross attention map of masked words. Left masked the word 'with' while (Right) masked the word 'heels'.
  • Figure 4: Example of Text Enrichment Module (TEM). TEM enhances textual descriptions by replacing original words based on the logit of the MLM head, resulting in richer and more precise descriptions.
  • Figure 5: Visualization results of AGM in the cross attention layer. By highly lighting the words with rich semantics, AGM largely improves the quality of the masked text, thereby better aligning cross-modal semantic representations.
  • ...and 1 more figures