Table of Contents
Fetching ...

Explainable deep learning improves human mental models of self-driving cars

Eoin M. Kenny, Akshay Dharmavaram, Sang Uk Lee, Tung Phan-Minh, Shreyas Rajesh, Yunqing Hu, Laura Major, Momchil S. Tomov, Julie A. Shah

TL;DR

This work tackles the opacity of deep neural planners in autonomous driving by introducing the Concept-Wrapper Network (CW-Net), which grounds planner reasoning in human-interpretable concepts to produce causally faithful explanations. CW-Net preserves driving performance while enabling concept-based explanations that improve human drivers' mental models and situational awareness in both real-world and simulated settings. Across semi-naturalistic on-road tests and large online SAGAT studies, explanations improved prediction and understanding in surprising scenarios without harming routine performance, suggesting practical utility and potential regulatory relevance. The approach leverages a DriveIRL-based black-box planner augmented with a concept classifier and a new reward module, trained on large-scale driving data, with extensive evaluation demonstrating robustness, generalization, and accessibility for reproducibility.

Abstract

Self-driving cars increasingly rely on deep neural networks to achieve human-like driving. The opacity of such black-box planners makes it challenging for the human behind the wheel to accurately anticipate when they will fail, with potentially catastrophic consequences. While research into interpreting these systems has surged, most of it is confined to simulations or toy setups due to the difficulty of real-world deployment, leaving the practical utility of such techniques unknown. Here, we introduce the Concept-Wrapper Network (CW-Net), a method for explaining the behavior of machine-learning-based planners by grounding their reasoning in human-interpretable concepts. We deploy CW-Net on a real self-driving car and show that the resulting explanations improve the human driver's mental model of the car, allowing them to better predict its behavior. To our knowledge, this is the first demonstration that explainable deep learning integrated into self-driving cars can be both understandable and useful in a realistic deployment setting. CW-Net accomplishes this level of intelligibility while providing explanations which are causally faithful and do not sacrifice driving performance. Overall, our study establishes a general pathway to interpretability for autonomous agents by way of concept-based explanations, which could help make them more transparent and safe.

Explainable deep learning improves human mental models of self-driving cars

TL;DR

This work tackles the opacity of deep neural planners in autonomous driving by introducing the Concept-Wrapper Network (CW-Net), which grounds planner reasoning in human-interpretable concepts to produce causally faithful explanations. CW-Net preserves driving performance while enabling concept-based explanations that improve human drivers' mental models and situational awareness in both real-world and simulated settings. Across semi-naturalistic on-road tests and large online SAGAT studies, explanations improved prediction and understanding in surprising scenarios without harming routine performance, suggesting practical utility and potential regulatory relevance. The approach leverages a DriveIRL-based black-box planner augmented with a concept classifier and a new reward module, trained on large-scale driving data, with extensive evaluation demonstrating robustness, generalization, and accessibility for reproducibility.

Abstract

Self-driving cars increasingly rely on deep neural networks to achieve human-like driving. The opacity of such black-box planners makes it challenging for the human behind the wheel to accurately anticipate when they will fail, with potentially catastrophic consequences. While research into interpreting these systems has surged, most of it is confined to simulations or toy setups due to the difficulty of real-world deployment, leaving the practical utility of such techniques unknown. Here, we introduce the Concept-Wrapper Network (CW-Net), a method for explaining the behavior of machine-learning-based planners by grounding their reasoning in human-interpretable concepts. We deploy CW-Net on a real self-driving car and show that the resulting explanations improve the human driver's mental model of the car, allowing them to better predict its behavior. To our knowledge, this is the first demonstration that explainable deep learning integrated into self-driving cars can be both understandable and useful in a realistic deployment setting. CW-Net accomplishes this level of intelligibility while providing explanations which are causally faithful and do not sacrifice driving performance. Overall, our study establishes a general pathway to interpretability for autonomous agents by way of concept-based explanations, which could help make them more transparent and safe.

Paper Structure

This paper contains 34 sections, 4 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Planner architecture, adapted with permission from Tomov et al. (2025) tomov2025treeirl. a. Autonomous vehicle stack. Sensory input is processed by the perception module to generate scene context $s$. The planning module processes $s$ to compute trajectory $\hat{\tau}$, which is followed by the control module. b. Black-box ML planner. Here, $s$ is fed to trajectory generator $G$, which produces candidate trajectories $\{\tau_1 \dots \tau_k\}$, and scene encoder $H$, which produces a scene embedding $\mathbf{h}$. These are fed into encoder $E$, which produces scene-trajectory embeddings $\mathbf{z}_i$ for each $\tau_i$, which are in turn fed into the reward layer $R$. $R$ computes a reward $r_i$ for each trajectory. The model is trained to output higher rewards for trajectories closer to the the ground-truth human trajectories. c. CW-Net. Identical to b., except $\mathbf{z}_i$ is fed to a concept classifier $C$, which produces concept assignments $\mathbf{c}_i$. These are fed to a new reward layer $R'$ to produce rewards $r'_i$. In parallel, $\mathbf{c}_i$ is processed to generate the explanations. The reward layer is trained to prefer the same trajectories as the black-box ML planner, while the concept layer is supervised with ground-truth scenario labels. The weights of the faded components are frozen during training. CLOSE, "Close to another vehicle". ASV, "Approaching stopped vehicle". BIKE, "Close to cyclist".
  • Figure 2: Deployment setup. A safety driver, a support engineer, and a researcher were present. The safety driver drove the AV manually between road tests, engaged self-driving mode at the start of each test, monitored AV performance during the test, and took over in case of unsafe driving. The support engineer deployed CW-Net and set scenario destinations. The researcher directed testing. The dashboard included a map with overlaid object detections ($s$) from the perception module and the output trajectory ($\hat{\tau}$). Explanations $\mathbf{c}_{\hat{i}}$ from CW-Net were shown as percentages for easier interpretation gigerenzer1995improve.
  • Figure 3: Results: a. The CLOSE concept activated when the car got stuck next to parked vehicles. The driver initially thought the pickup/drop-off area was the cause, but the explanation suggested that it was the nearby vehicles. When the driver engaged self-driving away from the park vehicles, activation of the CLOSE concept decreased and the AV started moving again, counter to driver’s initial mental model and consistent with the explanations. Across tests, CLOSE correlated with speed and the intercept accurately predicted this event. b. The ASV concept activated when the car stopped next to a traffic cone. The driver initially thought the cone was the cause, but the explanation suggested that the AV was hallucinating a stopped vehicle. When we removed the cone in a counterfactual test, the same phantom braking and concept activation occurred. Across tests, ASV correlated with reductions in speed when spiking above 0.5 probability. c. The BIKE concept failed to activate in our first round of tests with a cyclist, but the car always stopped safely for the cyclist. After observing the explanation, the driver engaged self-driving from slower speeds in a second round of tests. Follow-up analyses revealed that the AV stopped for the cyclist due to backup safety mechanisms unrelated to CW-Net, which indicates that the driver's increased level of caution was appropriate. a-c. (Right column) Naturalistic scenarios from public-road deployment, analogous to the scenarios observed on the private track. We used the PEDESTRIAN concept instead of the BIKE concept as it behaved similarly and pedestrians were more frequently encountered on public roads. Likewise, we used pedestrians in place of the traffic cone in our ASV tests as there were no cones encountered on the public roads. Standard error of the mean and 3-second rolling averages are shown in relevant plots.
  • Figure S1: Parallel architecture: This is identical to the black-box planner (Figure \ref{['fig:architecture']}b; greyed out here), except the scene-trajectory embeddings are fed to the concept classifier $C$ in parallel to the (original) reward model $R$. The black-box part of the architecture (greyed out) is kept frozen. These concept classifications are then converted to probabilities (x100 to convert to percentages) and presented to the user.
  • Figure S2: Study methodology. First we collected a series of surprising situations expert safety drivers encountered during their day-to-day job testing the AV where the explanations proved to be useful to understand why the AV made certain decisions and how it would behave in alternative scenarios. During these situations, we noted the driver's mental model through their think-aloud thoughts and behaviour. A mental model elicitation study was run with more expert drivers and non-experts online to more robustly verify these findings and if mental model goodness did indeed correlate with counterfactual predictive performance. Then, in a large-scale SAGAT study, we used projection as a surrogate metric for mental model quality, since it perfectly matches the prior prediction tasks.
  • ...and 13 more figures