Table of Contents
Fetching ...

TSCLIP: Robust CLIP Fine-Tuning for Worldwide Cross-Regional Traffic Sign Recognition

Guoyang Zhao, Fulong Ma, Weiqing Qi, Chenguang Zhang, Yuxuan Liu, Ming Liu, Jun Ma

TL;DR

This work tackles robust traffic sign recognition under global distribution shifts by introducing TSCLIP, a prompt-engineered CLIP fine-tuning framework augmented with adaptive dynamic weight ensembling. By constructing the CRTS dataset from ten regions and designing traffic-sign–specific prompts, TSCLIP maintains zero-shot generalization while learning region-specific cues. The adaptive factor mechanism dynamically balances zero-shot and task-specific knowledge during training, achieving state-of-the-art cross-regional performance and demonstrating strong generalization to unseen regional signs. The approach offers practical impact for worldwide autonomous driving and guided navigation, enabling reliable recognition across diverse traffic sign systems.

Abstract

Traffic sign is a critical map feature for navigation and traffic control. Nevertheless, current methods for traffic sign recognition rely on traditional deep learning models, which typically suffer from significant performance degradation considering the variations in data distribution across different regions. In this paper, we propose TSCLIP, a robust fine-tuning approach with the contrastive language-image pre-training (CLIP) model for worldwide cross-regional traffic sign recognition. We first curate a cross-regional traffic sign benchmark dataset by combining data from ten different sources. Then, we propose a prompt engineering scheme tailored to the characteristics of traffic signs, which involves specific scene descriptions and corresponding rules to generate targeted text descriptions. During the TSCLIP fine-tuning process, we implement adaptive dynamic weight ensembling (ADWE) to seamlessly incorporate outcomes from each training iteration with the zero-shot CLIP model. This approach ensures that the model retains its ability to generalize while acquiring new knowledge about traffic signs. To the best knowledge of authors, TSCLIP is the first contrastive language-image model used for the worldwide cross-regional traffic sign recognition task. The project website is available at: https://github.com/guoyangzhao/TSCLIP.

TSCLIP: Robust CLIP Fine-Tuning for Worldwide Cross-Regional Traffic Sign Recognition

TL;DR

This work tackles robust traffic sign recognition under global distribution shifts by introducing TSCLIP, a prompt-engineered CLIP fine-tuning framework augmented with adaptive dynamic weight ensembling. By constructing the CRTS dataset from ten regions and designing traffic-sign–specific prompts, TSCLIP maintains zero-shot generalization while learning region-specific cues. The adaptive factor mechanism dynamically balances zero-shot and task-specific knowledge during training, achieving state-of-the-art cross-regional performance and demonstrating strong generalization to unseen regional signs. The approach offers practical impact for worldwide autonomous driving and guided navigation, enabling reliable recognition across diverse traffic sign systems.

Abstract

Traffic sign is a critical map feature for navigation and traffic control. Nevertheless, current methods for traffic sign recognition rely on traditional deep learning models, which typically suffer from significant performance degradation considering the variations in data distribution across different regions. In this paper, we propose TSCLIP, a robust fine-tuning approach with the contrastive language-image pre-training (CLIP) model for worldwide cross-regional traffic sign recognition. We first curate a cross-regional traffic sign benchmark dataset by combining data from ten different sources. Then, we propose a prompt engineering scheme tailored to the characteristics of traffic signs, which involves specific scene descriptions and corresponding rules to generate targeted text descriptions. During the TSCLIP fine-tuning process, we implement adaptive dynamic weight ensembling (ADWE) to seamlessly incorporate outcomes from each training iteration with the zero-shot CLIP model. This approach ensures that the model retains its ability to generalize while acquiring new knowledge about traffic signs. To the best knowledge of authors, TSCLIP is the first contrastive language-image model used for the worldwide cross-regional traffic sign recognition task. The project website is available at: https://github.com/guoyangzhao/TSCLIP.
Paper Structure (28 sections, 9 equations, 5 figures, 4 tables)

This paper contains 28 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Traffic sign cross-regional recognition and results. (a) introduces the main content, fine-tuning TSCLIP on specific traffic sign datasets, and then performing recognition on other worldwide regions. (b) shows our TSCLIP model is far superior to the classic model and exceeds the mainstream scheme.
  • Figure 2: Pattern differences of cross-regional samples. Four representative traffic signs (No Overtaking, No Parking, No Pedestrians, and Stop).
  • Figure 3: Robust fine-tuning framework for TSCLIP model. (a) shows the contrastive language-image training process of TSCLIP with traffic sign prompts. (b) shows our proposed ADWE scheme for weight ensembling. (c) shows the Wise-FT scheme.
  • Figure 4: Evaluation of adaptive factors. We evaluate the fine-tuning effect of the adaptive factors under the settings of four scaling coefficient $\gamma$.
  • Figure 5: T-SNE visualization of different models. We selected two classic models and four CLIP-based models for testing on the cross-regional dataset.