Table of Contents
Fetching ...

EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation

Jun Zhou, Chunsheng Liu, Faliang Chang, Wenqian Wang, Penghui Hao, Yiming Huang, Zhiqiang Yang

TL;DR

EraW-Net is proposed, a novel end-to-end framework for scene-associated driver attention estimation by aggregating information from dual views that enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net.

Abstract

Associating driver attention with driving scene across two fields of views (FOVs) is a hard cross-domain perception problem, which requires comprehensive consideration of cross-view mapping, dynamic driving scene analysis, and driver status tracking. Previous methods typically focus on a single view or map attention to the scene via estimated gaze, failing to exploit the implicit connection between them. Moreover, simple fusion modules are insufficient for modeling the complex relationships between the two views, making information integration challenging. To address these issues, we propose a novel method for end-to-end scene-associated driver attention estimation, called EraW-Net. This method enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net. Specifically, a Dynamic Adaptive Filter Module (DAF-Module) is proposed to address the challenges of frequently changing driving environments by extracting vital regions. It suppresses the indiscriminately recorded dynamics and highlights crucial ones by innovative joint frequency-spatial analysis, enhancing the model's ability to parse complex dynamics. Additionally, to track driver states during non-fixed facial poses, we propose a Global Context Sharing Module (GCS-Module) to construct refined feature representations by capturing hierarchical features that adapt to various scales of head and eye movements. Finally, W-Net achieves systematic cross-view information integration through its "Encoding-Independent Partial Decoding-Fusion Decoding" structure, addressing semantic misalignment in heterogeneous data integration. Experiments demonstrate that the proposed method robustly and accurately estimates the mapping of driver attention in scene on large public datasets.

EraW-Net: Enhance-Refine-Align W-Net for Scene-Associated Driver Attention Estimation

TL;DR

EraW-Net is proposed, a novel end-to-end framework for scene-associated driver attention estimation by aggregating information from dual views that enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net.

Abstract

Associating driver attention with driving scene across two fields of views (FOVs) is a hard cross-domain perception problem, which requires comprehensive consideration of cross-view mapping, dynamic driving scene analysis, and driver status tracking. Previous methods typically focus on a single view or map attention to the scene via estimated gaze, failing to exploit the implicit connection between them. Moreover, simple fusion modules are insufficient for modeling the complex relationships between the two views, making information integration challenging. To address these issues, we propose a novel method for end-to-end scene-associated driver attention estimation, called EraW-Net. This method enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net. Specifically, a Dynamic Adaptive Filter Module (DAF-Module) is proposed to address the challenges of frequently changing driving environments by extracting vital regions. It suppresses the indiscriminately recorded dynamics and highlights crucial ones by innovative joint frequency-spatial analysis, enhancing the model's ability to parse complex dynamics. Additionally, to track driver states during non-fixed facial poses, we propose a Global Context Sharing Module (GCS-Module) to construct refined feature representations by capturing hierarchical features that adapt to various scales of head and eye movements. Finally, W-Net achieves systematic cross-view information integration through its "Encoding-Independent Partial Decoding-Fusion Decoding" structure, addressing semantic misalignment in heterogeneous data integration. Experiments demonstrate that the proposed method robustly and accurately estimates the mapping of driver attention in scene on large public datasets.
Paper Structure (28 sections, 36 equations, 9 figures, 6 tables)

This paper contains 28 sections, 36 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison of different tasks related to driver attention. The inputs to the model are highlighted with red boxes in the figure. (a) 3D gaze estimation: Determine what direction the driver is looking. (b) Driver attention prediction: Predict where should the driver look. (c) Gaze zone estimation: Estimate which zone the driver is looking. (d) and (e) are both for estimating where the driver is looking in the current scene. (d) Attention map projected by estimated 3D gaze: The results of 3D gaze estimation are mapped onto the scene image using camera parameters and depth information, which is a two-step estimation method. (e) Our proposed end-to-end mapping of driver attention to driving scene: End-to-end estimate driver attention in the current environment by leveraging complementary information from both driver-facing and scene-facing views.
  • Figure 2: Comparison of three fusion strategies and our W-Net for cross-domain information integration. (a) Decision-Level Fusion wang2023crack: Aggregate the independent decision outputs from the two sources into a final decision. (b) Late-Feature-Level Fusion liu2023cross: Process each input independently, then integrate the late features through post-processing. (c) Hierarchical-Feature-Level Fusion wen2023msgfusion: Fuse features from both inputs layer-by-layer. (d) W-Net: Use an architecture of "Encoding(Stage I)-Independent Partial Decoding(Stage II)-Fusion Decoding(Stage III)" to integrate information, support for two inputs from two different domains. Details are provided in Section III.D.
  • Figure 3: Architecture of EraW-Net. The overall architecture is based on the proposed W-Net structure, which includes three key stages: (a) encoding, (b) independent partial decoding, and (c) fusion decoding. During feature encoding, Channel Reduction Units (CRUs) are employed to standardize channel dimensions across corresponding layers of both branches. The Dynamic Adaptive Filter Module (DAF-Module) employs joint frequency-spatial filtering masks derived from inter-frame dynamics to emphasize significant dynamics within original features. The Global Context Sharing Module (GCS-Module) extracts and consolidates multi-scale features to refine a comprehensive global representation of facial features. The core of W-Net's Two-Stage Decoding (TS-Decoding) strategy (depicted in (b) and (c)) lies in its approach of intra-domain feature alignment before fusion decoding, ensuring semantic consistency across domains. This methodology ensures that subsequent fusion processes operate on well-aligned and high-quality feature representations.
  • Figure 4: The Dynamic Adaptive Filter Module (DAF-Module) processes inter-frame dynamic information by innovative joint frequency-spatial analysis, guiding the model to focus on critical motion characteristics. It first calculates dynamic features through local correlation, filters out redundant dynamics in the frequency domain. and then enhances spatial areas with significant short-term changes to enhance the features representation.
  • Figure 5: The proposed GCS-Module comprises two processes: Intra-Level Multi-scale Feature Aggregation (ILA) and Cross-Level Feature Semantic Alignment (CLA). The ILA aggregates multi-scale information embedded within features at each layer using a Hierarchical Feature Fusion (HFF) structure connected by channel-mixing attention, as shown in (b). The CLA unit then aligns information across layers to establish a globally refined feature representation.
  • ...and 4 more figures