DRIV-EX: Counterfactual Explanations for Driving LLMs

Amaia Cardiel; Eloi Zablocki; Elias Ramzi; Eric Gaussier

DRIV-EX: Counterfactual Explanations for Driving LLMs

Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Eric Gaussier

TL;DR

DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model's decision, successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.

Abstract

Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model's decision. Crucially, to avoid the incoherent text typical of unconstrained continuous optimization, DRIV-EX uses these optimized embeddings solely as a semantic guide: they are used to bias a controlled decoding process that re-generates the original scene description. This approach effectively steers the generation toward the counterfactual target while guaranteeing the linguistic fluency, domain validity, and proximity to the original input, essential for interpretability. Evaluated using the LC-LLM planner on a textual transcription of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.

DRIV-EX: Counterfactual Explanations for Driving LLMs

TL;DR

Abstract

Paper Structure (46 sections, 8 equations, 15 figures, 11 tables, 1 algorithm)

This paper contains 46 sections, 8 equations, 15 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Method: DRIV-EX
Problem and method overview
Formalization.
High-level method idea.
Discrete counterfactual optimization via straight-through embeddings
Continuous relaxation.
Decision-driven gradient update.
Projection-based regularization and evaluation
Fluency regularization with biased autoregressive decoding.
Input proximity regularization.
Best candidate selection.
Experiments
Experimental Protocol
...and 31 more sections

Figures (15)

Figure 1: Overview of DRIV-EX counterfactual generation. The LLM acts as the planner for the ego vehicle (in green). Given an initial driving scenario where the planner behaves safely (top row), our method automatically identifies a minimal semantic perturbation to the scene description (such as slightly altering the position or speed of surrounding vehicles) that forces the model into a dangerous failure mode (bottom row). By uncovering these decision boundaries, DRIV-EX exposes latent biases and evaluates the robustness of driving agents against critical edge cases.
Figure 2: Forward and backward passes, and bias computation. In the forward pass (black), the continuous soft embeddings $\mathbf{e}$ are projected onto their nearest discrete neighbors in the vocabulary to obtain tokens $\mathbf{x}$, following \ref{['eq:projection']}. These tokens are then processed by the model $\mathcal{M}$ to compute the probability $P_{\mathcal{M}}(y^*_T \mid \mathbf{y}_{<T}, \mathbf{x})$. In the backward pass (orange), the model calculates the loss relative to the desired target token $y^*_T$ (\ref{['eq:decision_loss']}). Gradients ($\nabla$) are backpropagated from the target decision and, using a straight-through (s-t) estimator, bypass the discrete projection to update the continuous embeddings $\mathbf{e}$ (\ref{['eq:gradient_step']}). Finally, the updated embeddings are converted into vocabulary bias terms ($\mathcal{B}$, in pink) to guide subsequent regularization (following Eq. \ref{['eq:voc_penalization']}--\ref{['eq:bias']}).
Figure 3: Regularized autoregressive decoding. During the regularization phase, the vocabulary bias terms $\mathcal{B}$ and $\mathcal{B}'$ (derived from optimized embeddings and $\mathbf{x}^o$) are added to the logits $\mathbf{l}$ of a fluency model $\mathcal{F}$. This combined signal biases the auto-regressive decoding, following \ref{['eq:biased_sampling']}, to generate a new candidate sequence ($x_1, x_2, x_3$) that incorporates the decision-change signal while maintaining fluency and input proximity.
Figure 4: Histogram of number of token changes (%) across input position for successful counterfactuals (Llama3) that flip a 'Keep lane' decision to a collision-inducing 'Right lane change' ($n=33$). Peaks indicate tokens most critical for the decision flip. 'sv': 'surrounding vehicle', 'vx/vy/ax/ay' is for velocity and acceleration.
Figure 5: Lateral drift of ground truth trajectories per lane change class. In the coordinate system of the text templates, positive coordinate values correspond to left drifts with respect to the initial ego state, while negative values correspond to right drifts.
...and 10 more figures

DRIV-EX: Counterfactual Explanations for Driving LLMs

TL;DR

Abstract

DRIV-EX: Counterfactual Explanations for Driving LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (15)