CLIP-based Camera-Agnostic Feature Learning for Intra-camera Person Re-Identification

Xuan Tan; Xun Gong; Yang Xiang

CLIP-based Camera-Agnostic Feature Learning for Intra-camera Person Re-Identification

Xuan Tan, Xun Gong, Yang Xiang

TL;DR

A novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID is proposed, designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL).

Abstract

Contrastive Language-Image Pre-Training (CLIP) model excels in traditional person re-identification (ReID) tasks due to its inherent advantage in generating textual descriptions for pedestrian images. However, applying CLIP directly to intra-camera supervised person re-identification (ICS ReID) presents challenges. ICS ReID requires independent identity labeling within each camera, without associations across cameras. This limits the effectiveness of text-based enhancements. To address this, we propose a novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID. Accordingly, two custom modules are designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL). Specifically, we first establish learnable textual prompts for intra-camera pedestrian images to obtain crucial semantic supervision signals for subsequent intra- and inter-camera learning. Then, we design ICDL to increase inter-class variation by considering the hard positive and hard negative samples within each camera, thereby learning intra-camera finer-grained pedestrian features. Additionally, we propose ICAL to reduce inter-camera pedestrian feature discrepancies by penalizing the model's ability to predict the camera from which a pedestrian image originates, thus enhancing the model's capability to recognize pedestrians from different viewpoints. Extensive experiments on popular ReID datasets demonstrate the effectiveness of our approach. Especially, on the challenging MSMT17 dataset, we arrive at 58.9\% in terms of mAP accuracy, surpassing state-of-the-art methods by 7.6\%. Code will be available at: https://github.com/Trangle12/CCAFL.

CLIP-based Camera-Agnostic Feature Learning for Intra-camera Person Re-Identification

TL;DR

Abstract

Paper Structure (47 sections, 20 equations, 12 figures, 5 tables)

This paper contains 47 sections, 20 equations, 12 figures, 5 tables.

Introduction
Related Work
Intra-camera Supervised ReID
Unsupervised Person ReID
Vision-language Models
Adversarial Learning
METHODOLOGY
Overview
Intra-camera Pred-defined Labels Prompt Learning
Intra-camera Discriminative Learning
Intra-camera Hybrid Memory Banks Initialization
Optimization
Intra-camera Image-Text Alignment
The Loss for Intra-camera Learning
Inter-camera Learning
...and 32 more sections

Figures (12)

Figure 1: Illustration of label settings under different person Re-ID data configurations. The light-blue areas represent the intra-camera and cross-camera feature spaces, with different shapes corresponding to different identities. (a) Conventional fully supervised training data requires unified identity annotation across all cameras. (b) Intra-camera supervised (ICS) training data only requires independent identity annotation within each camera view, utilizing separate class spaces. In ICS ReID data, superscripts of identity labels indicate camera view labels.
Figure 2: The diagram illustrates our proposed approach, which leverages CLIP and prompt learning to generate textual descriptions for person images within each camera. Based on this, we combine the textual information with intra-camera and inter-camera learning, enabling the model to focus better on discriminative features.
Figure 3: The framework of our CCAFL. Left: Through prompt learning paradigms, we generate text descriptions corresponding to the labels of each person's image within a camera. This provides semantic supervision information for subsequent intra-camera and inter-camera learning. Upper: In the intra-camera learning phase, we construct a hybrid memory for each camera, storing both the central features and instance features of pedestrians. By employing an intra-camera discriminative loss, we enhance the discriminability of pedestrian features within the same camera. Lower: In the inter-camera learning phase, we obtain cross-camera association IDs through a cross-camera association step. We then build a memory that stores prototype features of associated pedestrians, aiding the model in learning pedestrian features across different cameras. Additionally, we introduce a global ID classifier and incorporate inter-camera adversarial learning to mitigate the impact of camera discrepancies.
Figure 4: Illustration of $\mathcal{L}_{intra1}$ and $\mathcal{L}_{intra2}$. The same color indicates that all samples originate from the same camera, while different shapes represent different pedestrian IDs within the camera.
Figure 5: The probability distributions of intra-camera samples processed by the global classifier in Market-1501 dataset are as follows: The left figure illustrates that, after a certain number of training epochs, the samples with true intra-camera labels exhibit a distinct probability distribution with a sharp peak, indicating that the classifier effectively distinguishes different individuals across different cameras. The right figure shows that, after initiating inter-camera adversarial learning, inter-camera association labels are obtained through an inter-camera association algorithm. Samples with the same pseudo-label across different cameras are treated as positive examples, which enhances the probability distribution of the same person across cameras in the global classifier, resulting in multiple peaks.
...and 7 more figures

CLIP-based Camera-Agnostic Feature Learning for Intra-camera Person Re-Identification

TL;DR

Abstract

CLIP-based Camera-Agnostic Feature Learning for Intra-camera Person Re-Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (12)