EMDFNet: Efficient Multi-scale and Diverse Feature Network for Traffic Sign Detection

Pengyu Li; Chenhe Liu; Tengfei Li; Xinyu Wang; Shihui Zhang; Dongyang Yu

EMDFNet: Efficient Multi-scale and Diverse Feature Network for Traffic Sign Detection

Pengyu Li, Chenhe Liu, Tengfei Li, Xinyu Wang, Shihui Zhang, Dongyang Yu

TL;DR

EMDFNet tackles feature singularity and weak multi-scale fusion in traffic sign detection by introducing an Augmented Shortcut Module (ASM) and an Efficient Hybrid Encoder (EHE) built atop a Res2Net backbone and trained with SIoU loss. The approach diversifies feature representations and enhances cross-scale fusion, enabling robust small-object detection while preserving real-time, single-stage inference. Extensive experiments on TT100K and GTSDB demonstrate state-of-the-art $mAP$ and strong $AP_s$, with competitive $FPS$ and reduced parameters. The results underscore the effectiveness of diversified feature pathways and cross-scale integration for reliable TSD in complex driving scenes, and point toward further lightweight optimizations for practical deployment.

Abstract

The detection of small objects, particularly traffic signs, is a critical subtask within object detection and autonomous driving. Despite the notable achievements in previous research, two primary challenges persist. Firstly, the main issue is the singleness of feature extraction. Secondly, the detection process fails to effectively integrate with objects of varying sizes or scales. These issues are also prevalent in generic object detection. Motivated by these challenges, in this paper, we propose a novel object detection network named Efficient Multi-scale and Diverse Feature Network (EMDFNet) for traffic sign detection that integrates an Augmented Shortcut Module and an Efficient Hybrid Encoder to address the aforementioned issues simultaneously. Specifically, the Augmented Shortcut Module utilizes multiple branches to integrate various spatial semantic information and channel semantic information, thereby enhancing feature diversity. The Efficient Hybrid Encoder utilizes global feature fusion and local feature interaction based on various features to generate distinctive classification features by integrating feature information in an adaptable manner. Extensive experiments on the Tsinghua-Tencent 100K (TT100K) benchmark and the German Traffic Sign Detection Benchmark (GTSDB) demonstrate that our EMDFNet outperforms other state-of-the-art detectors in performance while retaining the real-time processing capabilities of single-stage models. This substantiates the effectiveness of EMDFNet in detecting small traffic signs.

EMDFNet: Efficient Multi-scale and Diverse Feature Network for Traffic Sign Detection

TL;DR

and strong

, with competitive

and reduced parameters. The results underscore the effectiveness of diversified feature pathways and cross-scale integration for reliable TSD in complex driving scenes, and point toward further lightweight optimizations for practical deployment.

Abstract

Paper Structure (20 sections, 7 equations, 5 figures, 5 tables)

This paper contains 20 sections, 7 equations, 5 figures, 5 tables.

Introduction
Related Work
Object Detection
Traffic Sign Detection
Proposed Network
Network Overview
Augmented Shortcut Module
Efficient Hybrid Encoder
Loss Function
Experiment
Dataset
Implementation Details
Evaluation Metrics
Quantitative Analysis
Comparisons of SOTA on TT100K Dataset.
...and 5 more sections

Figures (5)

Figure 1: The difficulties in traffic sign detection. In real traffic scenes, traffic sign detection faces many difficulties including illumination, occlusion, small size, viewpoint and so on.
Figure 2: The EMDFNet can be roughly divided into four parts. The first part consists of a backbone composed of Res2Net. The second part is the Augmented Shortcut Module, which enhances feature diversity. The third part is the Efficient Hybrid Encoder for integrating multi-scale features. Finally, the fourth part comprises three prediction heads for bounding box prediction.
Figure 3: The dilation rates used in (a) are [1, 2, 5], where all pixel values are effectively utilized, while in (b), the dilation rates are [2, 2, 2], resulting in a gridding effect.
Figure 4: Illustrations of 45 remaining traffic sign categories from the TT100K dataset.
Figure 5: Comparison of detection performance between EMDFNet and other models on the TT100K testing set. Zoom in to see details.

EMDFNet: Efficient Multi-scale and Diverse Feature Network for Traffic Sign Detection

TL;DR

Abstract

EMDFNet: Efficient Multi-scale and Diverse Feature Network for Traffic Sign Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)