Table of Contents
Fetching ...

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

Wei-Jer Chang, Wei Zhan, Masayoshi Tomizuka, Manmohan Chandraker, Francesco Pittaluga

TL;DR

LangTraj addresses scalable autonomous vehicle testing by enabling controllable simulation via natural language prompts. It introduces a language-conditioned diffusion model for joint multi-agent trajectories and a large interactive dataset, InterDrive, to learn semantics of agent interactions, complemented by a novel closed-loop training strategy to reduce compounding errors. The approach achieves realism and language controllability on real-world data and supports safety-critical, text-guided scenario generation, outperforming LLM-guided baselines in key settings. Together, LangTraj offers a flexible, scalable framework for counterfactual AV testing and language-based scenario design.

Abstract

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Open Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing. Project Website: https://langtraj.github.io/

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

TL;DR

LangTraj addresses scalable autonomous vehicle testing by enabling controllable simulation via natural language prompts. It introduces a language-conditioned diffusion model for joint multi-agent trajectories and a large interactive dataset, InterDrive, to learn semantics of agent interactions, complemented by a novel closed-loop training strategy to reduce compounding errors. The approach achieves realism and language controllability on real-world data and supports safety-critical, text-guided scenario generation, outperforming LLM-guided baselines in key settings. Together, LangTraj offers a flexible, scalable framework for counterfactual AV testing and language-based scenario design.

Abstract

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Open Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing. Project Website: https://langtraj.github.io/

Paper Structure

This paper contains 43 sections, 14 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Unconditioned (top), Text-Conditioned (mid), and Safety-Critical + Text-Conditioned (bot) Simulations by LangTraj. Adversarial collision guidance is applied in conjunction with text conditioning to generate the safety-critical scenarios shown in the bottom row, where the dark blue car serves as the adversarial agent. Language annotations are from the test set of InterDrive.
  • Figure 2: Overview of InterDrive dataset.InterDrive captures nuanced agent-agent interactions in real-world driving contexts. It includes human-labeled traffic interaction annotations from Waymo Motion and NuPlan datasets, along with single-agent behavioral labels generated through heuristic annotations, offering a comprehensive view of agent actions and interactions in diverse traffic scenarios. The top part of the figure shows the annotation processes for the human and heuristic annotations. The bottom part of the figure shows the counts of each interaction and interaction subtypes in InterDrive.
  • Figure 3: Overview of LangTraj. We introduce LangTraj, a novel language-controlled diffusion-based model for trajectory simulation that incorporates HD maps, agent histories, and text descriptions, enabling behaviorally nuanced trajectory generation.
  • Figure 4: Illustration of Closed-loop Training of diffusion models. The figure demonstrates the procedure of training diffusion models in a closed-loop setting. First, the model generates multiple denoised trajectory candidates. The closest candidate to the ground truth is selected and then executed, enabling the model to experience its own distribution during training