MARS: Paying more attention to visual attributes for text-based person search

Alex Ergasti; Tomaso Fontanini; Claudio Ferrari; Massimo Bertozzi; Andrea Prati

MARS: Paying more attention to visual attributes for text-based person search

Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

TL;DR

MARS tackles TBPS by aligning text and images in a shared latent space while mitigating inter-identity noise and intra-identity variation. It introduces a Masked AutoEncoder–based Visual Reconstruction Loss and an Attribute Loss, along with expanded cross-attention in the cross-modal encoder, to strengthen word- and attribute-grounded grounding. Empirical results on CUHK-PEDES, ICFG-PEDES, and RSTPReid show consistent $mAP$ gains and competitive rank-based metrics, validating the approach. The combination of visual reconstruction guidance and balanced attribute attention yields more discriminative and robust TBPS models with practical retrieval benefits.

Abstract

Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.

MARS: Paying more attention to visual attributes for text-based person search

TL;DR

gains and competitive rank-based metrics, validating the approach. The combination of visual reconstruction guidance and balanced attribute attention yields more discriminative and robust TBPS models with practical retrieval benefits.

Abstract

Paper Structure (20 sections, 19 equations, 8 figures, 2 tables)

This paper contains 20 sections, 19 equations, 8 figures, 2 tables.

Introduction
Related Works
Proposed Method
The MARS Architecture
Baseline Losses
Relation-Aware Loss.
Sensitive-Aware Loss.
Contrastive Loss.
Attribute Loss
Masked AutoEncoder Loss
Full Objective and Reranking
Experimental Results
Experimental Settings
Metrics
Datasets
...and 5 more sections

Figures (8)

Figure 1: CUHK-PEDES images and caption. On the left, a and b are examples of intra-identity variations where the visual attributes of the same person (e.g., pose, illumination, etc..) vary between images. On the right, c and d are examples of inter-identity variations where a caption can be matched to two identities which look very similar between each others but only one is correct (green for correct match, red for wrong match).
Figure 2: Overview of the proposed architecture (same color corresponds to shared parameters). Firstly, an input pair of image and text $(I,T)$ is fed to the Image Encoder $\mathcal{E}_v$ and the Text Encoder $\mathcal{E}_t$, respectively, and Contrastive Loss is applied to the obtained embeddings $\mathbf{v}$ and $\mathbf{t}$. Secondly, the MAE Decoder $\mathcal{D}_{mae}$ is trained to reconstruct a masked image patches sequence into the original unmasked one. Finally, text is fed to the Cross-Modal Encoder $\mathcal{E}_{cross}$ and the visual embeddings $\mathbf{v}$ are injected into its cross-attention layers. The output of $\mathcal{E}_{cross}$$\mathbf{f}$ is employed into three different loss functions: (a) the class token $f_{cls}$ is used in the Relation-Aware Loss to learn a matching function between positive and negative image-text pairs, then, (b) given a masked input text $T_{mask}$ Sensitive-Aware Loss is used to identify the masked word and finally, (c) the Attribute Loss is calculated over the embeddings corresponding to attributes chunks in the text.
Figure 3: An overview of the Attribute Loss. Using SpaCy, chunks of sentences containing nouns and related adjectives are identified. Then, after each token is processed by $\mathcal{E}_{cross}$, the average of each chunk embeddings is calculated. For each of them, the model then predicts if the image-chunk pair is a match or not. In the figure, chunks of words with the same color (i.e. green, red, orange and purple) represent the extracted chunks and their corresponding embeddings (each box represents an embedding).
Figure 4: Top 25 most common nouns and adjectives in CUHK-PEDES computed using SpaCy honnibal2020spacy
Figure 5: Overview of comparison between top 10 predictions of baseline and our model. Predicted images are ranked from left (i.e., position 1) to the right (i.e., position 10). Our model outperforms the baseline in several pairs, i.e., a,b,c,d. In pair c it is possible to observe how all predictions are with a bike in it, while this is not true in the baseline. Furthermore, even if in pair e our model does not predict the second position correctly, it is easy to observe how a higher mAP is achieve by providing 3 correct matches in top 10 positions compared to 2 correct matches in top 10 of the baseline. Lastly, in pair f our model is not able to predict any correct image due to the vagueness of the caption, but is still retrieving images closely related to the text.
...and 3 more figures

MARS: Paying more attention to visual attributes for text-based person search

TL;DR

Abstract

MARS: Paying more attention to visual attributes for text-based person search

Authors

TL;DR

Abstract

Table of Contents

Figures (8)