Table of Contents
Fetching ...

Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer

Hao Shao, Letian Wang, RuoBing Chen, Hongsheng Li, Yu Liu

TL;DR

The paper tackles safety in autonomous driving, focusing on long-tail rare events and interpretability of decisions. It proposes InterFuser, a one-stage interpretable sensor fusion Transformer that fuses multi-modal multi-view inputs and outputs intermediate representations such as the ego trajectory with $L=10$ waypoints and an object density map $M \in \mathbb{R}^{R\times R\times 7}$ plus traffic-rule signals. A safety controller uses these intermediate features to constrain actions within safe sets by computing maximum safe distances $s_1$, $s_2$ and solving a linear program for the desired speed, while forecasting other agents' motion with a tracker. Experiments on CARLA benchmarks show InterFuser achieves state-of-the-art driving performance and ranks first on the public leaderboard.

Abstract

Large-scale deployment of autonomous vehicles has been continually delayed due to safety concerns. On the one hand, comprehensive scene understanding is indispensable, a lack of which would result in vulnerability to rare but complex traffic situations, such as the sudden emergence of unknown objects. However, reasoning from a global context requires access to sensors of multiple types and adequate fusion of multi-modal sensor signals, which is difficult to achieve. On the other hand, the lack of interpretability in learning models also hampers the safety with unverifiable failure causes. In this paper, we propose a safety-enhanced autonomous driving framework, named Interpretable Sensor Fusion Transformer(InterFuser), to fully process and fuse information from multi-modal multi-view sensors for achieving comprehensive scene understanding and adversarial event detection. Besides, intermediate interpretable features are generated from our framework, which provide more semantics and are exploited to better constrain actions to be within the safe sets. We conducted extensive experiments on CARLA benchmarks, where our model outperforms prior methods, ranking the first on the public CARLA Leaderboard. Our code will be made available at https://github.com/opendilab/InterFuser

Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer

TL;DR

The paper tackles safety in autonomous driving, focusing on long-tail rare events and interpretability of decisions. It proposes InterFuser, a one-stage interpretable sensor fusion Transformer that fuses multi-modal multi-view inputs and outputs intermediate representations such as the ego trajectory with waypoints and an object density map plus traffic-rule signals. A safety controller uses these intermediate features to constrain actions within safe sets by computing maximum safe distances , and solving a linear program for the desired speed, while forecasting other agents' motion with a tracker. Experiments on CARLA benchmarks show InterFuser achieves state-of-the-art driving performance and ranks first on the public leaderboard.

Abstract

Large-scale deployment of autonomous vehicles has been continually delayed due to safety concerns. On the one hand, comprehensive scene understanding is indispensable, a lack of which would result in vulnerability to rare but complex traffic situations, such as the sudden emergence of unknown objects. However, reasoning from a global context requires access to sensors of multiple types and adequate fusion of multi-modal sensor signals, which is difficult to achieve. On the other hand, the lack of interpretability in learning models also hampers the safety with unverifiable failure causes. In this paper, we propose a safety-enhanced autonomous driving framework, named Interpretable Sensor Fusion Transformer(InterFuser), to fully process and fuse information from multi-modal multi-view sensors for achieving comprehensive scene understanding and adversarial event detection. Besides, intermediate interpretable features are generated from our framework, which provide more semantics and are exploited to better constrain actions to be within the safe sets. We conducted extensive experiments on CARLA benchmarks, where our model outperforms prior methods, ranking the first on the public CARLA Leaderboard. Our code will be made available at https://github.com/opendilab/InterFuser
Paper Structure (23 sections, 13 equations, 7 figures, 6 tables)

This paper contains 23 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Safe and efficient driving requires comprehensive scene understanding by fusing information from multiple sensors. Peeking into the intermediate interpretable features of learning models can also unveil the model's decision basis. Such features enable improvable systems with access to failure causes, and can be used as safety heuristic to constrain actions within the safe set.
  • Figure 2: Overview of our approach. We first use CNN backbones to extract features from multi-modal multi-view sensor inputs. The tokens from different sensors are then fused in the transformer encoder. Three types of queries are then fed into the transformer decoder to predict waypoints, object density maps and traffic rules respectively. At last, by recovering the traffic scene from the predicted object density map and utilizing the tracker to forecast the future motion of other objects, a safety controller is applied to enhance the safety and efficiency of driving in complex traffic situations.
  • Figure 3: (a) Two cases of how our method predicts waypoints and recovers the traffic scene. Blue points denote predicted waypoints. The yellow rectangle represents the ego vehicle, and white/grey rectangles denote the current/future positions of detected objects. (b) Visualization of attention weights the between object density map queries and the features from different views.
  • Figure 4: The driving preference varies when different safety factor is assigned to the safety controller. 100 % safety factor refers to the setting $\bar{s} = 2$ and $v_{max} = 6.5$, and 150 % safety factor refers to the setting $\bar{s} = 2 \times 150\%$ and $v_{max} = 6.5 / 150\%$. The Town05 Long with adversarial events benchmark is used here.
  • Figure 5: Four cases of how our method predicts waypoints and recover the traffic scene. Blue points denote predicted waypoints. Yellow rectangle represents the ego vehicle, and white/grey rectangles denote the current/future positions of detected objects.
  • ...and 2 more figures