Table of Contents
Fetching ...

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

Abstract

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

Abstract

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

Paper Structure

This paper contains 15 sections, 16 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 2: Overview of the proposed LLM-enhanced, conditional diffusion-based drivers' visual attention modeling method DiffAttn.
  • Figure 3: DiffAttn architecture overview. For saliency encoder, we adopt SwinT-Base swinT pretrained on ImageNet. The decoder is designed with an LLM-enhanced feature fusion pyramid (FFP), which bridges the encoder outputs, and a multi-scale dense-connected conditional diffusion module, where feature maps produced by FFP are densely connected and serve as conditioning signals for noise learning in the diffusion process. The noise predictors generate saliency maps at multiple scales, which are all supervised with groundtruth saliency maps. Saliency map generated at $s=0$ during testing.
  • Figure 4: Network architecture of LLM-based semantic enhancement.
  • Figure 5: Network architecture of multi-scale conditional diffusion.
  • Figure 6: Qualitative results on TrafficGaze: (a) Surrounding vehicle driving in right lane; (b) Changing to right lane with a truck ahead; (c) Straight driving with a traffic sign ahead. Qualitative results on DADA-2000: (d) Motorcycle crossing; (e) Two trucks ahead; (f) Nearby truck changing lane ahead; (g) Pedestrian crossing; (h) Turning right with collision risk involving a taxi; (i) Pedestrian running in front of ego-vehicle. Qualitative results on BDD-A: (j) Vehicle crossing ahead; (k) Straight driving with a traffic light ahead; (l) Driving past parked cars; (m) Lane change with a braking vehicle ahead; (n) Approaching a STOP line. Qualitative results on DrFixD-rainy: (o) Entering a main road with congested traffic; (p) Nearby left vehicle changing into the ego lane; (q) Driving through a green light; (r) Driving on a rural road with a bicyclist on the right; (s) Pedestrians standing at the roadside.
  • ...and 3 more figures