Table of Contents
Fetching ...

Prompt Decoupling for Text-to-Image Person Re-identification

Weihao Li, Lei Tan, Pingyang Dai, Yan Zhang

TL;DR

This work tackles text-to-image person re-identification by decoupling domain adaptation from task adaptation when applying CLIP. It introduces a prompt-tuning strategy to bridge the domain gap and a two-stage training procedure that first optimizes prompts (with CLIP frozen) for domain alignment, then freezes prompts and fine-tunes the encoders for fine-grained cross-modal matching. The method achieves consistent improvements across three benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid) compared to full CLIP fine-tuning, with notable gains in Rank-1 accuracy and mAP. The approach offers a practical pathway for transferring large vision-language pre-training to TIReID tasks, highlighting the importance of separating domain and task adaptation and the effectiveness of lightweight prompt-based domain alignment.

Abstract

Text-to-image person re-identification (TIReID) aims to retrieve the target person from an image gallery via a textual description query. Recently, pre-trained vision-language models like CLIP have attracted significant attention and have been widely utilized for this task due to their robust capacity for semantic concept learning and rich multi-modal knowledge. However, recent CLIP-based TIReID methods commonly rely on direct fine-tuning of the entire network to adapt the CLIP model for the TIReID task. Although these methods show competitive performance on this topic, they are suboptimal as they necessitate simultaneous domain adaptation and task adaptation. To address this issue, we attempt to decouple these two processes during the training stage. Specifically, we introduce the prompt tuning strategy to enable domain adaptation and propose a two-stage training approach to disentangle domain adaptation from task adaptation. In the first stage, we freeze the two encoders from CLIP and solely focus on optimizing the prompts to alleviate domain gap between the original training data of CLIP and downstream tasks. In the second stage, we maintain the fixed prompts and fine-tune the CLIP model to prioritize capturing fine-grained information, which is more suitable for TIReID task. Finally, we evaluate the effectiveness of our method on three widely used datasets. Compared to the directly fine-tuned approach, our method achieves significant improvements.

Prompt Decoupling for Text-to-Image Person Re-identification

TL;DR

This work tackles text-to-image person re-identification by decoupling domain adaptation from task adaptation when applying CLIP. It introduces a prompt-tuning strategy to bridge the domain gap and a two-stage training procedure that first optimizes prompts (with CLIP frozen) for domain alignment, then freezes prompts and fine-tunes the encoders for fine-grained cross-modal matching. The method achieves consistent improvements across three benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid) compared to full CLIP fine-tuning, with notable gains in Rank-1 accuracy and mAP. The approach offers a practical pathway for transferring large vision-language pre-training to TIReID tasks, highlighting the importance of separating domain and task adaptation and the effectiveness of lightweight prompt-based domain alignment.

Abstract

Text-to-image person re-identification (TIReID) aims to retrieve the target person from an image gallery via a textual description query. Recently, pre-trained vision-language models like CLIP have attracted significant attention and have been widely utilized for this task due to their robust capacity for semantic concept learning and rich multi-modal knowledge. However, recent CLIP-based TIReID methods commonly rely on direct fine-tuning of the entire network to adapt the CLIP model for the TIReID task. Although these methods show competitive performance on this topic, they are suboptimal as they necessitate simultaneous domain adaptation and task adaptation. To address this issue, we attempt to decouple these two processes during the training stage. Specifically, we introduce the prompt tuning strategy to enable domain adaptation and propose a two-stage training approach to disentangle domain adaptation from task adaptation. In the first stage, we freeze the two encoders from CLIP and solely focus on optimizing the prompts to alleviate domain gap between the original training data of CLIP and downstream tasks. In the second stage, we maintain the fixed prompts and fine-tune the CLIP model to prioritize capturing fine-grained information, which is more suitable for TIReID task. Finally, we evaluate the effectiveness of our method on three widely used datasets. Compared to the directly fine-tuned approach, our method achieves significant improvements.
Paper Structure (20 sections, 6 equations, 2 figures, 6 tables)

This paper contains 20 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of our approach, which adopts a two-stage training strategy. In the first stage (left), we optimize the prompts for domain adaptation while keeping CLIP frozen. In the second stage (right), we freeze the prompts and fine-tune CLIP for task adaptation.
  • Figure 2: Ablation study on prompt length on CUHK-PEDES.