Table of Contents
Fetching ...

Interpretable Decision-Making for End-to-End Autonomous Driving

Mona Mirzaie, Bodo Rosenhahn

TL;DR

The paper addresses the opacity of end-to-end autonomous driving by introducing a diversity-based regularizer that yields sparse, localized activation maps, enabling interpretable decision-making. Integrated into a TCP-inspired framework (DTCP), the method improves interpretability and safety without ensembles or traffic-rule sub-tasks, achieving competitive or state-of-the-art route completion on CARLA benchmarks with a monocular camera. Key contributions include the diversity loss formulation, extensive ablations, and interpretability evaluations (IoU, GTC, SC, and saliency correlations) that link visual explanations to driving performance. The results suggest that promoting feature diversity enhances both transparency and practical effectiveness, supporting scalable deployment of safer end-to-end autonomous driving systems.

Abstract

Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

Interpretable Decision-Making for End-to-End Autonomous Driving

TL;DR

The paper addresses the opacity of end-to-end autonomous driving by introducing a diversity-based regularizer that yields sparse, localized activation maps, enabling interpretable decision-making. Integrated into a TCP-inspired framework (DTCP), the method improves interpretability and safety without ensembles or traffic-rule sub-tasks, achieving competitive or state-of-the-art route completion on CARLA benchmarks with a monocular camera. Key contributions include the diversity loss formulation, extensive ablations, and interpretability evaluations (IoU, GTC, SC, and saliency correlations) that link visual explanations to driving performance. The results suggest that promoting feature diversity enhances both transparency and practical effectiveness, supporting scalable deployment of safer end-to-end autonomous driving systems.

Abstract

Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

Paper Structure

This paper contains 18 sections, 12 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: End-to-End Driving Framework: The TCP (Trajectory-guided Control Prediction) predicts control signals $\mathbf{a}^{traj,ctrl}$ using a front-camera image $\mathcal{I}$ and a set of measurement data $\mathcal{C}$ (navigational commands, velocity, target point). $\mathbf{F}^{traj}$ denotes concatenated image and measurement encoder features, whereas $\mathbf{F}_{t-1}^{ctrl}$ represents measurement features fused with attention-weighted image features at the current step $t-1$ (dashed arrow). In the trajectory unit, waypoints are predicted using a GRU layer, while the control branch forecasts multi-step control actions leveraging the trajectory branch. During training, two widely adopted loss functions, $\mathcal{L}_{\mathrm{traj}}$ and $\mathcal{L}_{\mathrm{ctrl}}$, together with our proposed $\mathcal{L}_{\mathrm{div}}$ are applied to minimize the difference between the predicted waypoints and actions, and those provided by the expert. During inference, the converted control signals from a PID controller $\mathbf{a}^{traj}$ and the control branch $\mathbf{a}^{ctrl}$ are aggregated using the situational action fusion to form the final control actions.
  • Figure 2: EigenCam Visualizations in Various Challenging Scenarios. From left to right: original image, DTCP (ours), and reproduced TCP. TCP shows a uniform feature distribution, primarily focusing on the left and right due to crossing traffic participants. In contrast, our method captures more diverse feature representations, enhancing focus on regions critical for driving decisions.
  • Figure 3: Example of Failure Cases. From left to right: heatmap of our model, heatmap binary mask, intersection of the bounding box and heatmap binary mask, and heatmap of the reproduced TCP.
  • Figure 4: Comparison of activation maps. Left: Heatmap of our model DTCP; right: heatmap of the reproduced TCP. Our model achieves superior performance by activating key regions relevant to driving decisions, thus enhancing interpretability.
  • Figure 5: Example of Generated Binary Mask. From left to right: heatmap with overlapping bounding boxes, binary mask derived from the heatmap, and binary mask of the bounding boxes.