Table of Contents
Fetching ...

LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers

Fabian Schmidt, Karol Fedurko, Markus Enzweiler, Abhinav Valada

TL;DR

LAD-Drive is introduced, a generative framework that structurally disentangles high-level intention from low-level spatial planning and achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions.

Abstract

While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.

LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers

TL;DR

LAD-Drive is introduced, a generative framework that structurally disentangles high-level intention from low-level spatial planning and achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions.

Abstract

While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.
Paper Structure (22 sections, 4 equations, 4 figures, 4 tables)

This paper contains 22 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: LAD-Drive disentangles high-level semantic reasoning from low-level spatial planning across three modules. First, the Language module uses a multimodal LLM (AD-MLLM) to synthesize sensor data and navigation instructions into contextualized hidden states. Second, the Action decoder infers a discrete meta-action distribution directly from these states. Finally, the Diffusion decoder utilizes this probabilistic conditioning to iteratively refine noisy priors into kinematically consistent, multimodal trajectories.
  • Figure 2: Detailed architecture of LAD-Drive. The framework utilizes an AD-MLLM backbone to process multi-view images, LiDAR point clouds, and navigation instructions into contextualized hidden states. A dedicated action decoder (green) infers a meta-action distribution from these states, which is fused with the vehicle's ego-status to form a joint state-intent representation. The diffusion decoder (orange) employs a shared transformer-based block to iteratively refine noisy trajectory anchors over two denoising steps, with each step's output updating the input for the next. This refinement process utilizes sequential multi-head cross-attention (MHCA) layers to ground trajectory generation in both the global visual-language context and the predicted probabilistic intent, outputting refined trajectories and their corresponding confidence scores.
  • Figure 3: Anchor trajectories generated via k-means clustering on training trajectories to provide motion priors for the initialization of the truncated diffusion process.
  • Figure 4: Qualitative comparison between LMDrive (top row) and LAD-Drive (bottom row). For LAD-Drive, the generated multimodal trajectories are visualized using a heat map color scheme, with paths ranging from white (low confidence) to red (high confidence). The predicted meta-action distribution from the action decoder is displayed in the top right of each LAD-Drive panel (L: Left, CLL: Change Lane Left, S: Straight, LF: Lane Follow, CLR: Change Lane Right, R: Right). Column 1 (Route Deviation): Despite the instruction to make a left turn, LMDrive turns right. LAD-Drive executes the correct intention while safely avoiding a cyclist. Column 2 (Off-Road & Collision Layout): LMDrive fails to interpret the road layout and instruction, driving off-road into a fence, whereas LAD-Drive successfully continues straight. Columns 3 & 4 (Collision Vehicle): In the third scenario, LMDrive collides head-on with a truck, while LAD-Drive safely navigates the lane. In the fourth scenario, LMDrive initiates an unsafe left lane change into traffic, ignoring the right-turn navigation instruction. In contrast, LAD-Drive effectively handles dynamic agents and remains safely on the road, demonstrating robustness even when the action decoder exhibits slight uncertainty.