Table of Contents
Fetching ...

M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving

Dongyang Xu, Haokun Li, Qingfan Wang, Ziying Song, Lei Chen, Hanming Deng

TL;DR

To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed and empowered the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety.

Abstract

End-to-end autonomous driving has witnessed remarkable progress. However, the extensive deployment of autonomous vehicles has yet to be realized, primarily due to 1) inefficient multi-modal environment perception: how to integrate data from multi-modal sensors more efficiently; 2) non-human-like scene understanding: how to effectively locate and predict critical risky agents in traffic scenarios like an experienced driver. To overcome these challenges, in this paper, we propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving. To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed. By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance with less data in closed-loop benchmarks. Source codes are available at https://anonymous.4open.science/r/M2DA-4772.

M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving

TL;DR

To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed and empowered the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety.

Abstract

End-to-end autonomous driving has witnessed remarkable progress. However, the extensive deployment of autonomous vehicles has yet to be realized, primarily due to 1) inefficient multi-modal environment perception: how to integrate data from multi-modal sensors more efficiently; 2) non-human-like scene understanding: how to effectively locate and predict critical risky agents in traffic scenarios like an experienced driver. To overcome these challenges, in this paper, we propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving. To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed. By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance with less data in closed-loop benchmarks. Source codes are available at https://anonymous.4open.science/r/M2DA-4772.
Paper Structure (36 sections, 16 equations, 4 figures, 8 tables)

This paper contains 36 sections, 16 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: We present M2DA, a multi-modal fusion transformer incorporating driver attention, for end-to-end autonomous driving. M2DA takes multi-view images and Lidar cloud points as inputs. Firstly, we use a DA prediction model to mimic the focal points of drivers' visual gaze, which is treated as a mask to adjust the weight of raw images to enhance image data. Then, ResNet-based backbones are used to extract image features and Lidar BEV representations. We utilize global average pooling with positional encoding to encode these extracted representations. Then, they are treated as queries to calculate cross-attention with point clouds and images, respectively, and the outputs are considered as the final fused features, which are then fed into the subsequent transformer encoder. Three types of queries, i.e., waypoint query, perception and prediction query, and traffic query, are fed into the transformer decoder to obtain corresponding features for downstream tasks. Lastly, M2DA adopts an auto-regressive waypoint prediction network to predict future waypoints and uses MLPs to predict the perception map for surrounding objects and traffic states.
  • Figure 2: Each row represents a representative traffic scenario encountered by M2DA. The three columns on the left display the left-view, front-view, and right-view images, respectively. The fourth column shows the prediction results for driver attention. The last column represents the perceived states of surrounding vehicles. The yellow box denotes the ego vehicle. White, light gray, and gray boxes represent the perceived surrounding vehicles' current positions, predicted positions at the next time interval, and predicted positions at the next two time intervals, respectively. Green dots and red dots represent safe future trajectories of the ego and unsafe areas where collisions are likely to occur, respectively.
  • Figure 3: Detailed visualization of the pedestrian crossing case.
  • Figure 4: Visualization of a failure case with three RGB images, the predicted driver attention, and a heatmap image representing perceptual information. Yellow and white boxes denote the ego vehicle and perceived surrounding objects, respectively. Green dots and red dots represent safe future trajectories and unsafe areas where collisions are likely to occur, respectively.