Table of Contents
Fetching ...

CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

Liping Lu, Zihao Fu, Duanfeng Chu, Wei Wang, Bingrong Xu

TL;DR

This work tackles vehicle re-identification by enriching feature representations with semantic information without relying on annotated attributes. It introduces CLIP-SENet, which uses a TinyCLIP image encoder as a Semantic Extraction Module to obtain raw semantic attributes and an Adaptive Fine-grained Enhancement Module to suppress noise and emphasize discriminative attributes, then fuses these with CNN-derived appearance features. The model is trained end-to-end with a combination of cross-entropy and supervised contrastive losses, achieving state-of-the-art results on VeRi-776, VehicleID, and VeRi-Wild while employing a compact, efficient encoder via TinyCLIP. Overall, CLIP-SENet reduces the need for textual annotations, improves fine-grained discrimination, and delivers strong practical impact for intelligent transportation systems.

Abstract

Vehicle re-identification (Re-ID) is a crucial task in intelligent transportation systems (ITS), aimed at retrieving and matching the same vehicle across different surveillance cameras. Numerous studies have explored methods to enhance vehicle Re-ID by focusing on semantic enhancement. However, these methods often rely on additional annotated information to enable models to extract effective semantic features, which brings many limitations. In this work, we propose a CLIP-based Semantic Enhancement Network (CLIP-SENet), an end-to-end framework designed to autonomously extract and refine vehicle semantic attributes, facilitating the generation of more robust semantic feature representations. Inspired by zero-shot solutions for downstream tasks presented by large-scale vision-language models, we leverage the powerful cross-modal descriptive capabilities of the CLIP image encoder to initially extract general semantic information. Instead of using a text encoder for semantic alignment, we design an adaptive fine-grained enhancement module (AFEM) to adaptively enhance this general semantic information at a fine-grained level to obtain robust semantic feature representations. These features are then fused with common Re-ID appearance features to further refine the distinctions between vehicles. Our comprehensive evaluation on three benchmark datasets demonstrates the effectiveness of CLIP-SENet. Our approach achieves new state-of-the-art performance, with 92.9% mAP and 98.7% Rank-1 on VeRi-776 dataset, 90.4% Rank-1 and 98.7% Rank-5 on VehicleID dataset, and 89.1% mAP and 97.9% Rank-1 on the more challenging VeRi-Wild dataset.

CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

TL;DR

This work tackles vehicle re-identification by enriching feature representations with semantic information without relying on annotated attributes. It introduces CLIP-SENet, which uses a TinyCLIP image encoder as a Semantic Extraction Module to obtain raw semantic attributes and an Adaptive Fine-grained Enhancement Module to suppress noise and emphasize discriminative attributes, then fuses these with CNN-derived appearance features. The model is trained end-to-end with a combination of cross-entropy and supervised contrastive losses, achieving state-of-the-art results on VeRi-776, VehicleID, and VeRi-Wild while employing a compact, efficient encoder via TinyCLIP. Overall, CLIP-SENet reduces the need for textual annotations, improves fine-grained discrimination, and delivers strong practical impact for intelligent transportation systems.

Abstract

Vehicle re-identification (Re-ID) is a crucial task in intelligent transportation systems (ITS), aimed at retrieving and matching the same vehicle across different surveillance cameras. Numerous studies have explored methods to enhance vehicle Re-ID by focusing on semantic enhancement. However, these methods often rely on additional annotated information to enable models to extract effective semantic features, which brings many limitations. In this work, we propose a CLIP-based Semantic Enhancement Network (CLIP-SENet), an end-to-end framework designed to autonomously extract and refine vehicle semantic attributes, facilitating the generation of more robust semantic feature representations. Inspired by zero-shot solutions for downstream tasks presented by large-scale vision-language models, we leverage the powerful cross-modal descriptive capabilities of the CLIP image encoder to initially extract general semantic information. Instead of using a text encoder for semantic alignment, we design an adaptive fine-grained enhancement module (AFEM) to adaptively enhance this general semantic information at a fine-grained level to obtain robust semantic feature representations. These features are then fused with common Re-ID appearance features to further refine the distinctions between vehicles. Our comprehensive evaluation on three benchmark datasets demonstrates the effectiveness of CLIP-SENet. Our approach achieves new state-of-the-art performance, with 92.9% mAP and 98.7% Rank-1 on VeRi-776 dataset, 90.4% Rank-1 and 98.7% Rank-5 on VehicleID dataset, and 89.1% mAP and 97.9% Rank-1 on the more challenging VeRi-Wild dataset.

Paper Structure

This paper contains 28 sections, 8 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of semantic enhancement methods in Re-ID. The semantic noise represents features that are not strongly correlated with the semantic attributes of vehicles, such as background roads and irrelevant obstructions. Increasing and decreasing weights are used to abstractly demonstrate the adjustments of different semantic features weights during model training.
  • Figure 2: The pipeline of the CLIP-SENet framework. For input images, both the CNN Backbone and the SEM process them in parallel. The CNN Backbone initially processes and extracts appearance features from the images. Concurrently, to prepare the images for SEM, they are processed to fit the input format of the ViT dosovitskiy2020image. Then, the SEM extracts raw semantic embeddings from vehicle images. These semantic embeddings are then fused with the vehicle appearance features in a high-dimensional space to maximize the preservation of semantic information. Meanwhile, the AFEM applies adaptive weighting to these raw semantic features, reducing the weight of noisy attributes while favoring those conducive to identity identification, resulting in refined semantic attributes. Finally, the refined features are added element-wise with the fused features to enhance the final feature representation.
  • Figure 3: Ablation study on loss function. "-S" means using SupCon loss to guide the training and "-T" means using Triplet loss to guide the training.
  • Figure 4: T-SNE van2008visualizing visualization of features extracted by the model. Randomly selected 36 images from each of 8 vehicle IDs on the VeRi-776 dataset are represented with different colors.
  • Figure 5: Activation map visualization. (a) Input images, (b) our Baseline, (c) our CLIP-SENet.
  • ...and 1 more figures