Table of Contents
Fetching ...

Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

Bin Wang, Yuying Liang, Lei Cai, Huakun Huang, Huanqiang Zeng

TL;DR

The paper addresses image ReID under challenging cross-camera conditions by leveraging self-supervision with a vision-language pre-trained model (CLIP). It introduces SVLL-ReID, a two-stage framework where language self-supervision in stage 1 improves text-prompt discriminability and vision self-supervision in stage 2 enhances image-feature discrimination. Through extensive experiments on six benchmarks, SVLL-ReID achieves state-of-the-art results and consistently outperforms CLIP-ReID, including notable gains on occluded scenes. The findings suggest that coupling self-supervision with vision-language pretraining can yield more robust ReID representations, with practical impact for surveillance and transportation systems.

Abstract

Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: 1) incorporating language self-supervision in the first training stage can make the learnable text prompts more distinguishable, and 2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: 1) the text prompt learning in the first stage can benefit from the language self-supervision, and 2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at https://github.com/BinWangGzhu/SVLL-ReID.

Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

TL;DR

The paper addresses image ReID under challenging cross-camera conditions by leveraging self-supervision with a vision-language pre-trained model (CLIP). It introduces SVLL-ReID, a two-stage framework where language self-supervision in stage 1 improves text-prompt discriminability and vision self-supervision in stage 2 enhances image-feature discrimination. Through extensive experiments on six benchmarks, SVLL-ReID achieves state-of-the-art results and consistently outperforms CLIP-ReID, including notable gains on occluded scenes. The findings suggest that coupling self-supervision with vision-language pretraining can yield more robust ReID representations, with practical impact for surveillance and transportation systems.

Abstract

Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: 1) incorporating language self-supervision in the first training stage can make the learnable text prompts more distinguishable, and 2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: 1) the text prompt learning in the first stage can benefit from the language self-supervision, and 2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at https://github.com/BinWangGzhu/SVLL-ReID.
Paper Structure (12 sections, 11 equations, 4 figures, 2 tables)

This paper contains 12 sections, 11 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Visualization of text feature distribution via t-SNE resulted from text encoder of both CLIP-ReID (a) and our SVLL-ReID (b) in the first stage. Different colors indicate different IDs.
  • Figure 2: Visualization of image feature distribution via t-SNE from image encoder of both CLIP-ReID (a) and our SVLL-ReID (b) in the second stage. Different colors incidate different IDs.
  • Figure 3: Comparison between CLIP-ReID li2023clip and the proposed SVLL-ReID. (a) is the CLIP-ReID method, which freezes the text encoder and image encoder in the first stage, and optimizes a set of learnable text tokens (i.e., $[S]_1$$[S]_2$$[S]_3$...$[S]_{\text{M}}$) according to vision supervision imposed by ReID images and the image encoder together, and then text prompts together with text encoder are to deliver language supervision to fine-tune the pre-trained image encoder in the second stage. (b) is the proposed SVLL-ReID, which also freezes the text encoder and image encoder in the first stage, but besides the vision supervision, the language self-supervision provided by augmented prompts is also introduced to help optimize $[S]_1$$[S]_2$$[S]_3$...$[S]_{\text{M}}$. Similarly, besides the language supervision, the vision self-supervision provided by augmented images is also introduced to help fine-tune the pre-trained image encoder in the second stage.
  • Figure 4: Visualization of attention maps. (a) Input images, (b) CLIP-ReID, and (c) SVLL-ReID (Ours).