Table of Contents
Fetching ...

Salient Object Detection in Traffic Scene through the TSOD10K Dataset

Yu Qiu, Yuhang Sun, Jie Mei, Lin Xiao, Jing Xu

TL;DR

The paper tackles safety-aware perception in driving by redefining saliency to combine semantic risk with visual conspicuity. It introduces TSOD10K, the first large-scale traffic salient object detection benchmark with pixel-level safety-semantic annotations, and presents Tramba, a Mamba-based network that integrates Dual-Frequency Visual State Space (DFVSS) and Helix-SS2D to capture both fine details and global directional context while incorporating driving attention priors. Comprehensive experiments show Tramba achieving state-of-the-art performance on TSOD10K-TE and competitive NSI-SOD results on DUTS-TE, with ablations validating the contributions of HFVSS, LFVSS, and Helix-SS2D. The work provides a foundation for safety-aware saliency analysis in intelligent transportation systems and suggests directions for multi-modal fusion and real-time deployment.

Abstract

Traffic Salient Object Detection (TSOD) aims to segment the objects critical to driving safety by combining semantic (e.g., collision risks) and visual saliency. Unlike SOD in natural scene images (NSI-SOD), which prioritizes visually distinctive regions, TSOD emphasizes the objects that demand immediate driver attention due to their semantic impact, even with low visual contrast. This dual criterion, i.e., bridging perception and contextual risk, re-defines saliency for autonomous and assisted driving systems. To address the lack of task-specific benchmarks, we collect the first large-scale TSOD dataset with pixel-wise saliency annotations, named TSOD10K. TSOD10K covers the diverse object categories in various real-world traffic scenes under various challenging weather/illumination variations (e.g., fog, snowstorms, low-contrast, and low-light). Methodologically, we propose a Mamba-based TSOD model, termed Tramba. Considering the challenge of distinguishing inconspicuous visual information from complex traffic backgrounds, Tramba introduces a novel Dual-Frequency Visual State Space module equipped with shifted window partitioning and dilated scanning to enhance the perception of fine details and global structure by hierarchically decomposing high/low-frequency components. To emphasize critical regions in traffic scenes, we propose a traffic-oriented Helix 2D-Selective-Scan (Helix-SS2D) mechanism that injects driving attention priors while effectively capturing global multi-direction spatial dependencies. We establish a comprehensive benchmark by evaluating Tramba and 22 existing NSI-SOD models on TSOD10K, demonstrating Tramba's superiority. Our research establishes the first foundation for safety-aware saliency analysis in intelligent transportation systems.

Salient Object Detection in Traffic Scene through the TSOD10K Dataset

TL;DR

The paper tackles safety-aware perception in driving by redefining saliency to combine semantic risk with visual conspicuity. It introduces TSOD10K, the first large-scale traffic salient object detection benchmark with pixel-level safety-semantic annotations, and presents Tramba, a Mamba-based network that integrates Dual-Frequency Visual State Space (DFVSS) and Helix-SS2D to capture both fine details and global directional context while incorporating driving attention priors. Comprehensive experiments show Tramba achieving state-of-the-art performance on TSOD10K-TE and competitive NSI-SOD results on DUTS-TE, with ablations validating the contributions of HFVSS, LFVSS, and Helix-SS2D. The work provides a foundation for safety-aware saliency analysis in intelligent transportation systems and suggests directions for multi-modal fusion and real-time deployment.

Abstract

Traffic Salient Object Detection (TSOD) aims to segment the objects critical to driving safety by combining semantic (e.g., collision risks) and visual saliency. Unlike SOD in natural scene images (NSI-SOD), which prioritizes visually distinctive regions, TSOD emphasizes the objects that demand immediate driver attention due to their semantic impact, even with low visual contrast. This dual criterion, i.e., bridging perception and contextual risk, re-defines saliency for autonomous and assisted driving systems. To address the lack of task-specific benchmarks, we collect the first large-scale TSOD dataset with pixel-wise saliency annotations, named TSOD10K. TSOD10K covers the diverse object categories in various real-world traffic scenes under various challenging weather/illumination variations (e.g., fog, snowstorms, low-contrast, and low-light). Methodologically, we propose a Mamba-based TSOD model, termed Tramba. Considering the challenge of distinguishing inconspicuous visual information from complex traffic backgrounds, Tramba introduces a novel Dual-Frequency Visual State Space module equipped with shifted window partitioning and dilated scanning to enhance the perception of fine details and global structure by hierarchically decomposing high/low-frequency components. To emphasize critical regions in traffic scenes, we propose a traffic-oriented Helix 2D-Selective-Scan (Helix-SS2D) mechanism that injects driving attention priors while effectively capturing global multi-direction spatial dependencies. We establish a comprehensive benchmark by evaluating Tramba and 22 existing NSI-SOD models on TSOD10K, demonstrating Tramba's superiority. Our research establishes the first foundation for safety-aware saliency analysis in intelligent transportation systems.

Paper Structure

This paper contains 20 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Task discrepancy visualization. Row 2&3 shows that the salient objects in traffic scenes are sometimes not visually significant, indicating that the TSOD task is driven by both semantic and visual factors.
  • Figure 2: Some traffic images and their corresponding salient object labels picked from the TSOD10K dataset, covering diverse situations: non-motorized vehicles suddenly appearing, motion blur of pedestrians, inconspicuous vehicles suddenly about to overturn, poor visibility in extreme weather conditions/low-light, and glare caused by direct sunlight or car headlights.
  • Figure 3: Data statistics of our TSOD10K dataset. (a) Risk stratification: proportions of Normalcy, Emergency, and Crisis scenarios; (b) Object category prevalence including Vehicle, Human, Signage, and Obstacles; (c) Environmental condition distribution including Fine-Weather, Inclement, and Low-Light; (d) Target size analysis on Large vs. Small objects.
  • Figure 4: Data statistic of our TSOD10K on attribute dependence and object location. (a): Multi-dependencies among attributes, with larger arc lengths indicating higher correlation probabilities. (b): Locations of traffic salient objects' center points.
  • Figure 5: Detailed illustration of our Tramba.A: Tramba adopts a U-shaped encoder-decoder architecture, with the encoder built on VMamba-based Visual State Space (VSS). B: Our Dual-Frequency VSS (DFVSS) \ref{['sec:dfvss']} decouples encoded features into frequency domains via DCT, utilizing the high-frequency VSS (red component) with a sliding local window mechanism for fine-grained details, and a low-frequency VSS (blue components) with a dilated leapfrog scanning mechanism for global contextual dependencies. C1-C2: The basic VSS blocks incorporate horizontal and vertical scanning (D1-D2), while our Helix-VSS (HVSS) \ref{['sec:hvss']} introduces a center-focused Helix scanning strategy (D3-D6) for driving-centric perception.
  • ...and 7 more figures