Table of Contents
Fetching ...

LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search

Haiguang Wang, Yu Wu, Mengxia Wu, Cao Min, Min Zhang

TL;DR

The Local Alignment from Image-Phrase modeling (LAIP) framework is proposed, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module, which focuses on mask reconstruction within the noun phrase rather than the entire text, ensuring an unbiased masking strategy.

Abstract

Text-based person search aims at retrieving images of a particular person based on a given textual description. A common solution for this task is to directly match the entire images and texts, i.e., global alignment, which fails to deal with discerning specific details that discriminate against appearance-similar people. As a result, some works shift their attention towards local alignment. One group matches fine-grained parts using forward attention weights of the transformer yet underutilizes information. Another implicitly conducts local alignment by reconstructing masked parts based on unmasked context yet with a biased masking strategy. All limit performance improvement. This paper proposes the Local Alignment from Image-Phrase modeling (LAIP) framework, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module.BidirAtt goes beyond the typical forward attention by considering the gradient of the transformer as backward attention, utilizing two-sided information for local alignment. MPM focuses on mask reconstruction within the noun phrase rather than the entire text, ensuring an unbiased masking strategy. Extensive experiments conducted on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate the superiority of the LAIP framework over existing methods.

LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search

TL;DR

The Local Alignment from Image-Phrase modeling (LAIP) framework is proposed, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module, which focuses on mask reconstruction within the noun phrase rather than the entire text, ensuring an unbiased masking strategy.

Abstract

Text-based person search aims at retrieving images of a particular person based on a given textual description. A common solution for this task is to directly match the entire images and texts, i.e., global alignment, which fails to deal with discerning specific details that discriminate against appearance-similar people. As a result, some works shift their attention towards local alignment. One group matches fine-grained parts using forward attention weights of the transformer yet underutilizes information. Another implicitly conducts local alignment by reconstructing masked parts based on unmasked context yet with a biased masking strategy. All limit performance improvement. This paper proposes the Local Alignment from Image-Phrase modeling (LAIP) framework, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module.BidirAtt goes beyond the typical forward attention by considering the gradient of the transformer as backward attention, utilizing two-sided information for local alignment. MPM focuses on mask reconstruction within the noun phrase rather than the entire text, ensuring an unbiased masking strategy. Extensive experiments conducted on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate the superiority of the LAIP framework over existing methods.
Paper Structure (24 sections, 19 equations, 7 figures, 6 tables)

This paper contains 24 sections, 19 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a): Illustration of the confusing local semantic consistency in text-based person search. (b): Comparison between MLM (left) and MPM (right).
  • Figure 2: Visualization of the activation map of LAIP. The warmer the color tone, the stronger its activation to the input phrase.
  • Figure 3: Illustration of the overall architecture. The left-hand side of this figure is the baseline ALBEF li2021align with additional fusion triplet loss for global alignment between the image and text. The right-hand side is the proposed BidirAtt+MPM for local matching between the image and phrase in the text.
  • Figure 4: A simplified computation procedure of the forward attention and backward attention.
  • Figure 5: Visualization of top-10 retrieval results on CUHK-PEDES. The first row in each example presents the retrieval results from the baseline ALBEF li2021align, and the second row shows the results from LAIP. Correct/Incorrect images are marked by green / red rectangles.
  • ...and 2 more figures