Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition
Aditya Mishra, Akshay Agarwal, Haroon Lone
TL;DR
This work tackles nighttime traffic sign recognition by introducing INTSD, a real world Indian nighttime TSR dataset with 41 sign classes and 14,044 instances across six districts, plus daytime images for context. It then proposes LENS-Net, a two stage framework that combines an adaptive illumination aware detector with a multimodal CLIP GCNN classifier that leverages cross modal attention and graph based attribute reasoning. Empirical results show LENS-Net achieves state of the art performance on INTSD with $mAP@50=92.56\%$ for detection and $Acc=88.59\%$ for classification, with ablations confirming the contribution of adaptive image enhancement, CLIP driven text priors, and shape color graph reasoning. The work also provides cross domain evaluation on RTSD to assess generalization, highlighting both the potential and the limitations of current multimodal approaches for robust TSR under adverse illumination, and sets a new benchmark for night time traffic sign recognition.
Abstract
Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.
