Table of Contents
Fetching ...

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval

Delong Liu, Haiwen Li, Zhicheng Zhao, Yuan Dong

TL;DR

This work tackles Text-to-Image Person Retrieval (TIPR) by bridging the gap between visual and textual modalities with the Semantic Enhancement Network (SEN). SEN uses dual CLIP-based encoders, a Text-guided Image Restoration (TIR) auxiliary task, a cross-modal triplet (CMT) loss, and pruning-based text augmentation to achieve fine-grained cross-modal alignment and efficient inference. Key contributions include restoring masked image patches guided by text to reveal local correspondences, optimizing hard positives/negatives via CMT, and focusing on essential attributes through text pruning, all within an end-to-end trainable framework. The approach achieves state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReid, demonstrating strong cross-modal alignment and practical efficiency for large-scale TIPR tasks.

Abstract

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model's discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval

TL;DR

This work tackles Text-to-Image Person Retrieval (TIPR) by bridging the gap between visual and textual modalities with the Semantic Enhancement Network (SEN). SEN uses dual CLIP-based encoders, a Text-guided Image Restoration (TIR) auxiliary task, a cross-modal triplet (CMT) loss, and pruning-based text augmentation to achieve fine-grained cross-modal alignment and efficient inference. Key contributions include restoring masked image patches guided by text to reveal local correspondences, optimizing hard positives/negatives via CMT, and focusing on essential attributes through text pruning, all within an end-to-end trainable framework. The approach achieves state-of-the-art results on CUHK-PEDES, ICFG-PEDES, and RSTPReid, demonstrating strong cross-modal alignment and practical efficiency for large-scale TIPR tasks.

Abstract

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model's discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.
Paper Structure (30 sections, 7 equations, 8 figures, 7 tables)

This paper contains 30 sections, 7 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration of the development of text-to-image person retrieval. The green dashed box (A) represents the early global-matching methods where global features are extracted separately from texts and images, followed by a matching and alignment process. The brown dashed box (B) represents subsequent methods that focus on local-matching. These methods explicitly extract local features from both images and texts while performing global and local feature matching and alignment. The blue dashed box (C) indicates more recent approaches that, in addition to feature alignment, leverage joint image-text auxiliary tasks to aid in implicit feature alignment.
  • Figure 2: Overview of the proposed SEN framework, which consists of two feature encoders and one cross-modal interaction decoder. (a) SEN utilizes two types of image inputs (complete images and ones with randomly masked patches) and a probability pruning-based text input. The network employs three representation learning branches and four loss functions, including ID loss ($\mathcal{L}_{id}$), SDM loss ($\mathcal{L}_{sdm}$), CMT loss ($\mathcal{L}_{cmt}$), and MSE loss ($\mathcal{L}_{tir}$) in the TIR module. (b) CMT loss compute the loss through selecting the most challenging sample pairs. (c) TIR module uses a lightweight decoder that incorporates cross-attention mechanisms to facilitate efficient text-guided image restoration.
  • Figure 3: Illustration of data augmentation methods for the visual and textual modalities. The blue box on the left represents the conversion of image input to grayscale during the execution of the TIR auxiliary task. The grayscale image is then randomly masked and fed into the network, which is still required to recover the color image. The brown box on the right represents the deletion of non-key parts in sentences with a certain probability for text input.
  • Figure 4: Ablation experiment results of the TIR module. (a) illustrates the Rank-1 and mAP metrics of SEN under different depths of the cross-modal interaction decoder. (b) demonstrates the Rank-1 and mAP metrics of SEN input with varying image masking ratios when utilizing the TIR module.
  • Figure 5: Comparison of the top 10 retrieval results between the baseline (first row), IRRA jiang2023cross (second row), and SEN (third row) on CUHK-PEDES. Matching and non-matching images corresponding to the text description are highlighted with green and red bounding boxes, respectively. The matched textual entities and local image regions are indicated with phrases and bounding boxes of the same color in the figure.
  • ...and 3 more figures