Table of Contents
Fetching ...

TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification

Zheng-An Zhu, Hsin-Che Chien, Chen-Kuo Chiang

TL;DR

TCMM addresses two core issues in unsupervised person re-identification: ViT patch noises and data inconsistency from memory-bank training. It introduces a ViT Token Constraint to suppress patch noise influence and a Multi-scale Memory Bank with prototype and instance memories to stabilize representations and exploit outliers. The prototype memory uses cluster prototypes with a prototype contrast loss, while the instance memory uses an anchor contrastive loss with hardest-in-class positives and cross-class negatives, all updated via momentum. Together, these components yield state-of-the-art results on Market-1501 and MSMT17, with strong robustness and diversity without altering pseudo-label generation or increasing model cost.

Abstract

This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{https://github.com/andy412510/TCMM}{https://github.com/andy412510/TCMM}.

TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification

TL;DR

TCMM addresses two core issues in unsupervised person re-identification: ViT patch noises and data inconsistency from memory-bank training. It introduces a ViT Token Constraint to suppress patch noise influence and a Multi-scale Memory Bank with prototype and instance memories to stabilize representations and exploit outliers. The prototype memory uses cluster prototypes with a prototype contrast loss, while the instance memory uses an anchor contrastive loss with hardest-in-class positives and cross-class negatives, all updated via momentum. Together, these components yield state-of-the-art results on Market-1501 and MSMT17, with strong robustness and diversity without altering pseudo-label generation or increasing model cost.

Abstract

This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{https://github.com/andy412510/TCMM}{https://github.com/andy412510/TCMM}.
Paper Structure (17 sections, 9 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 9 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Hierarchical conceptual comparison of different feature level memory. \ref{['fig:hierarchical_ins']} shows instance memory composed of all data, including outliers, and contrastive learning training with batch data. \ref{['fig:hierarchical_pro']} shows the calculation of cluster prototype memory using data without outliers and contrastive learning training with batch data. \ref{['fig:hierarchical_our']} is the TCMM architecture. TCMM first applies token constraint to ViT features to reduce the interference of patch noise on the model. TCMM uses prototype memory without outliers to mitigate the negative impact of potential noise on the model and address feature inconsistency. TCMM also considers instance memory that includes outliers to fully account for the value of all samples. Through these mechanisms, TCMM achieves state-of-the-art performance without adding an additional correcting model or changing the pseudo-label generation process and results.
  • Figure 2: The overall architecture of TCMM. A ViT encoder processes the input data to obtain token features. All token features are input into a transformer token constraint process to mitigate the negative impact of patch noises. Features of all training data are assigned pseudo labels using clustering algorithms (DBSCAN is used as an example here). All data is adopted to form instance and prototype level memory using pseudo labels. Then anchor contrastive loss and prototype contrastive loss are computed with batch input and memory. Both memory modules are updated using the momentum update technique after contrastive learning.
  • Figure 3: The process of ViT token constraint.The output CLS feature $f_b$ of the encoder is compared for similarity with all patch token features. Drawing from the concept of contrastive learning, $f_b$ is brought closer to the token with the highest similarity and pushed further away from $R$ tokens with the lowest similarity.
  • Figure 4: The process of prototype contrastive loss with DBSCAN example. The input data $X^M, X$ are fed into the ViT encoder to extract features, which are then input into DBSCAN to obtain pseudo labels and compute prototypes. These prototypes are stored in the prototype level memory and compared with the anchor $f_b$ to calculate the prototype contrastive learning loss function.
  • Figure 5: The process of anchor contrastive loss with DBSCAN example. The input data $X^M, X$ are fed into the ViT encoder to extract features, which are then input into DBSCAN to obtain pseudo labels. This information is stored in the instance level memory and compared with the anchor $f_b$ to calculate the contrastive learning loss function.
  • ...and 4 more figures