Table of Contents
Fetching ...

DualFocus: Integrating Plausible Descriptions in Text-based Person Re-identification

Yuchuan Deng, Zhanpeng Hu, Jiakun Han, Chuang Deng, Qijun Zhao

TL;DR

DualFocus is introduced, a unified framework that integrates plausible descriptions to enhance the interpretative accuracy of vision-language models in TPR tasks, and proposes the Dynamic Tokenwise Similarity (DTS) loss, which refines the representation of both matching and non-matching descriptions, thereby improving the matching process through detailed and adaptable similarity assessments.

Abstract

Text-based Person Re-identification (TPR) aims to retrieve specific individual images from datasets based on textual descriptions. Existing TPR methods primarily focus on recognizing explicit and positive characteristics, often overlooking the role of negative descriptions. This oversight can lead to false positives-images that meet positive criteria but should be excluded based on negative descriptions. To address these limitations, we introduce DualFocus, a unified framework that integrates plausible descriptions to enhance the interpretative accuracy of vision-language models in TPR tasks. DualFocus leverages Dual (Positive/Negative) Attribute Prompt Learning (DAPL), which incorporates Dual Image-Attribute Contrastive (DIAC) Learning and Sensitive Image-Attributes Matching (SIAM) Learning, enabling the detection of non-existent attributes and reducing false positives. To achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we propose the Dynamic Tokenwise Similarity (DTS) loss, which refines the representation of both matching and non-matching descriptions, thereby improving the matching process through detailed and adaptable similarity assessments. The comprehensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid, DualFocus demonstrates superior performance over state-of-the-art methods, significantly enhancing both precision and robustness in TPR.

DualFocus: Integrating Plausible Descriptions in Text-based Person Re-identification

TL;DR

DualFocus is introduced, a unified framework that integrates plausible descriptions to enhance the interpretative accuracy of vision-language models in TPR tasks, and proposes the Dynamic Tokenwise Similarity (DTS) loss, which refines the representation of both matching and non-matching descriptions, thereby improving the matching process through detailed and adaptable similarity assessments.

Abstract

Text-based Person Re-identification (TPR) aims to retrieve specific individual images from datasets based on textual descriptions. Existing TPR methods primarily focus on recognizing explicit and positive characteristics, often overlooking the role of negative descriptions. This oversight can lead to false positives-images that meet positive criteria but should be excluded based on negative descriptions. To address these limitations, we introduce DualFocus, a unified framework that integrates plausible descriptions to enhance the interpretative accuracy of vision-language models in TPR tasks. DualFocus leverages Dual (Positive/Negative) Attribute Prompt Learning (DAPL), which incorporates Dual Image-Attribute Contrastive (DIAC) Learning and Sensitive Image-Attributes Matching (SIAM) Learning, enabling the detection of non-existent attributes and reducing false positives. To achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we propose the Dynamic Tokenwise Similarity (DTS) loss, which refines the representation of both matching and non-matching descriptions, thereby improving the matching process through detailed and adaptable similarity assessments. The comprehensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid, DualFocus demonstrates superior performance over state-of-the-art methods, significantly enhancing both precision and robustness in TPR.
Paper Structure (33 sections, 12 equations, 2 figures, 7 tables)

This paper contains 33 sections, 12 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Illustration of the effect of negative descriptions. The green box signifies a triumphant retrieval, whereas the red box delineates a failed retrieval. Negative descriptions offer additional information that refines the search parameters, enabling a more accurate identification of an individual.
  • Figure 2: Overview of th propsed DualFocus framework. It consists of six encoders: one Image Encoder ($E_I$), three Text Encoders ($E_T$), and two Cross Encoders ($E_C$). The Image Encoder generates embeddings from visual inputs, while the Text Encoders produce embeddings from textual data. The Cross Encoders integrate these embeddings to enhance cross-modal prediction tasks. Text in grey boxes indicates tasks related to learning processes. Training strategies include Dual Image-Attribute Contrastive Learning (DIAC), which distinguishes images based on attributes; Sensitive Image-Attributes Matching Learning (SIAM), aligning attributes in text with images; Dynamic Tokenwise Similarity Loss (DTS), adjusting token similarity measures for accuracy; Masked Positive Attribute Language Modeling (MPAM), predicting masked positive attributes in context; and Masked Language Modeling (MLM), improving understanding by predicting missing words in sentences. Identity loss(ID), which ensures the preservation and recognition of individual characteristics across different modalities.