Table of Contents
Fetching ...

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, Andreas Geiger

TL;DR

The paper tackles end-to-end autonomous driving with multi-modal sensor fusion, highlighting limitations of geometry-based fusion in dense urban scenes. It proposes TransFuser, a transformer-based multi-modal fusion architecture that fuses image and LiDAR BEV features at multiple resolutions to capture global context. It demonstrates state-of-the-art driving scores on the CARLA leaderboard and strong infractions reductions on the Longest6 benchmark, supported by extensive ablations, attention analyses, and a Latent TransFuser image-only baseline. The work also introduces a demanding Longest6 benchmark to enable robust evaluation and discusses practical limitations and avenues for future extensions.

Abstract

How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

TL;DR

The paper tackles end-to-end autonomous driving with multi-modal sensor fusion, highlighting limitations of geometry-based fusion in dense urban scenes. It proposes TransFuser, a transformer-based multi-modal fusion architecture that fuses image and LiDAR BEV features at multiple resolutions to capture global context. It demonstrates state-of-the-art driving scores on the CARLA leaderboard and strong infractions reductions on the Longest6 benchmark, supported by extensive ablations, attention analyses, and a Latent TransFuser image-only baseline. The work also introduces a demanding Longest6 benchmark to enable robust evaluation and discusses practical limitations and avenues for future extensions.

Abstract

How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.
Paper Structure (24 sections, 9 equations, 7 figures, 10 tables)

This paper contains 24 sections, 9 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Illustration. Consider an intersection with oncoming traffic from the left. To safely navigate the intersection, the agent (green) must capture the global context of the scene involving the interaction between the traffic light (yellow) and the crossing traffic (red). Our TransFuser model integrates geometric and semantic information across multiple modalities via attention mechanisms to capture global context, leading to safe driving behavior in CARLA.
  • Figure 2: Architecture. We consider RGB image and LiDAR BEV representations (Section \ref{['sec:io_parameterization']}) as inputs to our multi-modal fusion transformer (TransFuser) which uses several transformer modules for the fusion of intermediate feature maps between both modalities. This fusion is applied at multiple resolutions throughout the feature extractor, resulting in a 512-dimensional feature vector output from both the image and LiDAR BEV stream, which are combined via element-wise summation. This 512-dimensional feature vector constitutes a compact representation of the environment that encodes the global context of the 3D scene. It is then processed with an MLP before passing it to an auto-regressive waypoint prediction network. We use a single layer GRU followed by a linear layer that takes in the hidden state and predicts the differential ego-vehicle waypoints $\{ \delta \mathbf{w}_{t}\}_{t=1}^T$, represented in the ego-vehicle's current coordinate frame.
  • Figure 3: Auxiliary Loss Functions. Besides the waypoint loss (Eq. \ref{['eqn:loss']}), we incorporate four auxiliary tasks: depth prediction and semantic segmentation from the image branch; HD map prediction and vehicle detection from the BEV branch.
  • Figure 4: Expert performing an unprotected left turn. The black boxes on the street mark the path that the expert has to follow. The predictions of the bicycle model are colored green for the expert and blue for all other vehicles. Red bounding boxes mark predicted collisions. The white box around the car is used to detect the traffic light trigger boxes that are placed on the street (e.g. bottom left Fig. \ref{['fig:expert1']}).
  • Figure 5: Lane Change Failures. TransFuser fails at lane changes in dense traffic incurring a high number of consecutive collisions in routes where these situations occur. Two examples are shown in the top and bottom rows. Time goes forward from left to right.
  • ...and 2 more figures