Table of Contents
Fetching ...

Textual Explanations for Self-Driving Vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata

TL;DR

This paper tackles the opacity of end-to-end self-driving systems by proposing a grounded, introspective textual explanation framework. It combines a visual attention–driven vehicle controller with a textual explanation generator, linked through two attention-alignment schemes (SAA and WAA) to ground natural language justifications in the controller's focal regions. The authors introduce the BDD-X dataset, comprising thousands of driving videos with time-stamped action descriptions and justifications, to evaluate both the driving decisions and the quality of explanations. Empirical results show that attention grounding improves explanation plausibility and that introspective explanations, particularly with weak alignment, better align with human rationales. The work advances explainability in autonomous driving and provides a practical benchmark for future research in grounded textual explanations for real-time control systems.

Abstract

Deep neural perception and control networks have become key components of self-driving vehicles. User acceptance is likely to benefit from easy-to-interpret textual explanations which allow end-users to understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. We propose a new approach to introspective explanations which consists of two parts. First, we use a visual (spatial) attention model to train a convolutional network end-to-end from images to the vehicle control commands, i.e., acceleration and change of course. The controller's attention identifies image regions that potentially influence the network's output. Second, we use an attention-based video-to-text model to produce textual explanations of model actions. The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments. We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-X) dataset. Code is available at https://github.com/JinkyuKimUCB/explainable-deep-driving.

Textual Explanations for Self-Driving Vehicles

TL;DR

This paper tackles the opacity of end-to-end self-driving systems by proposing a grounded, introspective textual explanation framework. It combines a visual attention–driven vehicle controller with a textual explanation generator, linked through two attention-alignment schemes (SAA and WAA) to ground natural language justifications in the controller's focal regions. The authors introduce the BDD-X dataset, comprising thousands of driving videos with time-stamped action descriptions and justifications, to evaluate both the driving decisions and the quality of explanations. Empirical results show that attention grounding improves explanation plausibility and that introspective explanations, particularly with weak alignment, better align with human rationales. The work advances explainability in autonomous driving and provides a practical benchmark for future research in grounded textual explanations for real-time control systems.

Abstract

Deep neural perception and control networks have become key components of self-driving vehicles. User acceptance is likely to benefit from easy-to-interpret textual explanations which allow end-users to understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. We propose a new approach to introspective explanations which consists of two parts. First, we use a visual (spatial) attention model to train a convolutional network end-to-end from images to the vehicle control commands, i.e., acceleration and change of course. The controller's attention identifies image regions that potentially influence the network's output. Second, we use an attention-based video-to-text model to produce textual explanations of model actions. The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments. We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-X) dataset. Code is available at https://github.com/JinkyuKimUCB/explainable-deep-driving.

Paper Structure

This paper contains 12 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Our model predicts vehicle’s control commands, i.e., an acceleration and a change of course, at each timestep, while an explanation model generates a natural language explanation of the rationales, e.g., "The car is driving forward because there are no other cars in its lane", and a visual explanation in the form of attention -- attended regions directly influence the textual explanation generation process.
  • Figure 2: Vehicle controller generates spatial attention maps $\alpha^c$ for each frame, predicts acceleration and change of course ($\hat{c}_t, \hat{a}_t$) that condition the explanation. Explanation generator predicts temporal attention across frames ($\beta$) and a spatial attention in each frame ($\alpha^j$). SAA uses $\alpha^c$, WAA enforces a loss between $\alpha^j$ and $\alpha^c$.
  • Figure 3: (A) Examples of input frames and corresponding human-annotated action description and justification of how a driving decision was made. For visualization, we sample frames at every two seconds. (B) BDD-X dataset details. Over 77 hours of driving with time-stamped human annotations for action descriptions and justifications.
  • Figure 4: Vehicle controller’s attention maps in terms of four different entropy regularization coefficient $\lambda_{c}$={0,10,100,1000}. Red parts indicate where the model pays more attention. Higher value of $\lambda_{c}$ makes the attention maps sparser. We observe that sparser attention maps improves the performance of generating textual explanations, while control performance is slightly degraded.
  • Figure 5: Example descriptions and explanations generated by our model compared to human annotations. We provide (top row) input raw images and attention maps by (from the 2nd row) vehicle controller, textual explanation generator, and rationalization model (Note: ($\lambda_{c}, \lambda_{a}$) = (100,10) and the synthetic separator token is replaced by '+').
  • ...and 3 more figures