Table of Contents
Fetching ...

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

Jiachen Li, Xiaojin Gong

TL;DR

The paper tackles adapting large vision-language models to object Re-ID without relying on semantic class labels. It introduces a direct CLIP fine-tuning approach using the prototypical contrastive loss $\mathcal{L}_{pcl}$ and a memory bank of per-ID centroids $\mathcal{K}$, eliminating the need for textual prompts. Empirically, a single $\mathcal{L}_{pcl}$ loss is competitive with CLIP-ReID in supervised Re-ID, and combining $\mathcal{L}_{pcl}$ with $\mathcal{L}_{id}$ yields strong gains on MSMT17; the framework also extends to unsupervised Re-ID by leveraging established PCL-based losses with a stabilizing patch-projection-free trick. The results demonstrate that prompt learning is not necessary for CLIP adaptation to Re-ID and highlight the viability of a simpler, centroid-based, prompt-free fine-tuning pathway for both person and vehicle Re-ID with competitive or state-of-the-art performance.

Abstract

This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance.

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

TL;DR

The paper tackles adapting large vision-language models to object Re-ID without relying on semantic class labels. It introduces a direct CLIP fine-tuning approach using the prototypical contrastive loss and a memory bank of per-ID centroids , eliminating the need for textual prompts. Empirically, a single loss is competitive with CLIP-ReID in supervised Re-ID, and combining with yields strong gains on MSMT17; the framework also extends to unsupervised Re-ID by leveraging established PCL-based losses with a stabilizing patch-projection-free trick. The results demonstrate that prompt learning is not necessary for CLIP adaptation to Re-ID and highlight the viability of a simpler, centroid-based, prompt-free fine-tuning pathway for both person and vehicle Re-ID with competitive or state-of-the-art performance.

Abstract

This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance.
Paper Structure (15 sections, 5 equations, 3 figures, 4 tables)

This paper contains 15 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: t-SNE tsne visualization of randomly selected 7 IDs from MSMT17. (a) shows that the text centroids learned in CLIP-ReID stage-1 are pretty close to the image centroids, which reveals their implicit equivalence. (b) shows that PCL is also able to learn high-quality feature space only with image centroids. Best view with color.
  • Figure 2: The framework of our PCL-CLIP model for supervised Re-ID. Different from CLIP-ReID that consists of a prompt learning stage and a fine-tuning stage, our approach directly fine-tune CLIP with a single prototypical contrastive learning (PCL) loss. In our framework, a memory bank is built to store up-to-date visual feature centroid of each ID.
  • Figure 3: The performance of CLIP-ReID, PCL-CLIP2, and PCL-CLIP4 varying during the fine-tuning process. The solid line denotes mean average precision (mAP) and the dash line denotes the rank-1 accuracy. Best view with color.