Revolutionizing Traffic Sign Recognition: Unveiling the Potential of Vision Transformers
Susano Mingwin, Yulong Shisu, Yongshuai Wanwag, Sunshin Huing
TL;DR
This paper tackles robust traffic sign recognition by leveraging Vision Transformers. It introduces a pyramid EATFormer backbone that integrates Evolutionary Algorithm–based Transformer blocks (EAT) with modules for Global and Local Interaction (GLI), Multi-Scale Region Aggregation (MSRA), and Modulated Deformable MSA (MD-MSA), along with a Task-Related Head for flexible fusion. Across the GTSRB and BelgiumTS datasets, the approach achieves higher accuracy and faster inference than multiple CNN baselines and existing ViT variants, demonstrating practical potential for real-time TSR in ADAS and autonomous driving. The work highlights Vision Transformers as a promising direction for precise and dependable traffic sign classification in real-world conditions.
Abstract
This research introduces an innovative method for Traffic Sign Recognition (TSR) by leveraging deep learning techniques, with a particular emphasis on Vision Transformers. TSR holds a vital role in advancing driver assistance systems and autonomous vehicles. Traditional TSR approaches, reliant on manual feature extraction, have proven to be labor-intensive and costly. Moreover, methods based on shape and color have inherent limitations, including susceptibility to various factors and changes in lighting conditions. This study explores three variants of Vision Transformers (PVT, TNT, LNL) and six convolutional neural networks (AlexNet, ResNet, VGG16, MobileNet, EfficientNet, GoogleNet) as baseline models. To address the shortcomings of traditional methods, a novel pyramid EATFormer backbone is proposed, amalgamating Evolutionary Algorithms (EAs) with the Transformer architecture. The introduced EA-based Transformer block captures multi-scale, interactive, and individual information through its components: Feed-Forward Network, Global and Local Interaction, and Multi-Scale Region Aggregation modules. Furthermore, a Modulated Deformable MSA module is introduced to dynamically model irregular locations. Experimental evaluations on the GTSRB and BelgiumTS datasets demonstrate the efficacy of the proposed approach in enhancing both prediction speed and accuracy. This study concludes that Vision Transformers hold significant promise in traffic sign classification and contributes a fresh algorithmic framework for TSR. These findings set the stage for the development of precise and dependable TSR algorithms, benefiting driver assistance systems and autonomous vehicles.
