Salient Object Detection in Traffic Scene through the TSOD10K Dataset
Yu Qiu, Yuhang Sun, Jie Mei, Lin Xiao, Jing Xu
TL;DR
The paper tackles safety-aware perception in driving by redefining saliency to combine semantic risk with visual conspicuity. It introduces TSOD10K, the first large-scale traffic salient object detection benchmark with pixel-level safety-semantic annotations, and presents Tramba, a Mamba-based network that integrates Dual-Frequency Visual State Space (DFVSS) and Helix-SS2D to capture both fine details and global directional context while incorporating driving attention priors. Comprehensive experiments show Tramba achieving state-of-the-art performance on TSOD10K-TE and competitive NSI-SOD results on DUTS-TE, with ablations validating the contributions of HFVSS, LFVSS, and Helix-SS2D. The work provides a foundation for safety-aware saliency analysis in intelligent transportation systems and suggests directions for multi-modal fusion and real-time deployment.
Abstract
Traffic Salient Object Detection (TSOD) aims to segment the objects critical to driving safety by combining semantic (e.g., collision risks) and visual saliency. Unlike SOD in natural scene images (NSI-SOD), which prioritizes visually distinctive regions, TSOD emphasizes the objects that demand immediate driver attention due to their semantic impact, even with low visual contrast. This dual criterion, i.e., bridging perception and contextual risk, re-defines saliency for autonomous and assisted driving systems. To address the lack of task-specific benchmarks, we collect the first large-scale TSOD dataset with pixel-wise saliency annotations, named TSOD10K. TSOD10K covers the diverse object categories in various real-world traffic scenes under various challenging weather/illumination variations (e.g., fog, snowstorms, low-contrast, and low-light). Methodologically, we propose a Mamba-based TSOD model, termed Tramba. Considering the challenge of distinguishing inconspicuous visual information from complex traffic backgrounds, Tramba introduces a novel Dual-Frequency Visual State Space module equipped with shifted window partitioning and dilated scanning to enhance the perception of fine details and global structure by hierarchically decomposing high/low-frequency components. To emphasize critical regions in traffic scenes, we propose a traffic-oriented Helix 2D-Selective-Scan (Helix-SS2D) mechanism that injects driving attention priors while effectively capturing global multi-direction spatial dependencies. We establish a comprehensive benchmark by evaluating Tramba and 22 existing NSI-SOD models on TSOD10K, demonstrating Tramba's superiority. Our research establishes the first foundation for safety-aware saliency analysis in intelligent transportation systems.
