Table of Contents
Fetching ...

Language Conditioned Traffic Generation

Shuhan Tan, Boris Ivanovic, Xinshuo Weng, Marco Pavone, Philipp Kraehenbuehl

TL;DR

This work introduces LCTGen, a first-of-its-kind framework for language-conditioned traffic generation that uses a large language model–driven Interpreter, a map Retrieval module, and a transformer-based Generator to produce realistic, controllable traffic scenes from natural language descriptions. It leverages scenario-only driving data with an Encoder to create paired representations, enabling end-to-end learning without language–traffic data. Empirical results show substantial gains in scene realism (MMD, mADE, mFDE, SCR) and language alignment (human studies) over prior methods, and demonstrate useful applications in instructional traffic scenario editing and controllable policy evaluation. The approach holds practical promise for scalable, linguistically controllable traffic scenario generation in self-driving development and testing.

Abstract

Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at https://ariostgx.github.io/lctgen.

Language Conditioned Traffic Generation

TL;DR

This work introduces LCTGen, a first-of-its-kind framework for language-conditioned traffic generation that uses a large language model–driven Interpreter, a map Retrieval module, and a transformer-based Generator to produce realistic, controllable traffic scenes from natural language descriptions. It leverages scenario-only driving data with an Encoder to create paired representations, enabling end-to-end learning without language–traffic data. Empirical results show substantial gains in scene realism (MMD, mADE, mFDE, SCR) and language alignment (human studies) over prior methods, and demonstrate useful applications in instructional traffic scenario editing and controllable policy evaluation. The approach holds practical promise for scalable, linguistically controllable traffic scenario generation in self-driving development and testing.

Abstract

Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at https://ariostgx.github.io/lctgen.
Paper Structure (43 sections, 11 equations, 10 figures, 7 tables)

This paper contains 43 sections, 11 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of our LCTGen model.
  • Figure 2: Example Interpreter input and output. We only show partial texts for brevity.
  • Figure 3: Architecture of our Generator model.
  • Figure 4: Qualitative results on text-conditioned generation.
  • Figure 5: Instructional editing on a real-world scenario. Refer to Supp.A for full prompts.
  • ...and 5 more figures