CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

Liping Lu; Zihao Fu; Duanfeng Chu; Wei Wang; Bingrong Xu

CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

Liping Lu, Zihao Fu, Duanfeng Chu, Wei Wang, Bingrong Xu

TL;DR

This work tackles vehicle re-identification by enriching feature representations with semantic information without relying on annotated attributes. It introduces CLIP-SENet, which uses a TinyCLIP image encoder as a Semantic Extraction Module to obtain raw semantic attributes and an Adaptive Fine-grained Enhancement Module to suppress noise and emphasize discriminative attributes, then fuses these with CNN-derived appearance features. The model is trained end-to-end with a combination of cross-entropy and supervised contrastive losses, achieving state-of-the-art results on VeRi-776, VehicleID, and VeRi-Wild while employing a compact, efficient encoder via TinyCLIP. Overall, CLIP-SENet reduces the need for textual annotations, improves fine-grained discrimination, and delivers strong practical impact for intelligent transportation systems.

Abstract

Vehicle re-identification (Re-ID) is a crucial task in intelligent transportation systems (ITS), aimed at retrieving and matching the same vehicle across different surveillance cameras. Numerous studies have explored methods to enhance vehicle Re-ID by focusing on semantic enhancement. However, these methods often rely on additional annotated information to enable models to extract effective semantic features, which brings many limitations. In this work, we propose a CLIP-based Semantic Enhancement Network (CLIP-SENet), an end-to-end framework designed to autonomously extract and refine vehicle semantic attributes, facilitating the generation of more robust semantic feature representations. Inspired by zero-shot solutions for downstream tasks presented by large-scale vision-language models, we leverage the powerful cross-modal descriptive capabilities of the CLIP image encoder to initially extract general semantic information. Instead of using a text encoder for semantic alignment, we design an adaptive fine-grained enhancement module (AFEM) to adaptively enhance this general semantic information at a fine-grained level to obtain robust semantic feature representations. These features are then fused with common Re-ID appearance features to further refine the distinctions between vehicles. Our comprehensive evaluation on three benchmark datasets demonstrates the effectiveness of CLIP-SENet. Our approach achieves new state-of-the-art performance, with 92.9% mAP and 98.7% Rank-1 on VeRi-776 dataset, 90.4% Rank-1 and 98.7% Rank-5 on VehicleID dataset, and 89.1% mAP and 97.9% Rank-1 on the more challenging VeRi-Wild dataset.

CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

TL;DR

Abstract

CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)