Table of Contents
Fetching ...

Text-to-Drive: Diverse Driving Behavior Synthesis via Large Language Models

Phat Nguyen, Tsun-Hsuan Wang, Zhang-Wei Hong, Sertac Karaman, Daniela Rus

TL;DR

Text-to-Drive (T2D) presents a knowledge-driven framework that converts natural language descriptions of driving behaviors into diverse, executable policies for simulation. An LLM-based two-stage pipeline first generates behavior descriptions and then constructs low- to high-level state mappings and finite-state reward structures, enabling a driving policy to learn via multi-agent RL with primary and auxiliary rewards. The method preserves behavioral context across language, code, and policy, and demonstrates improved trajectory diversity across intersections, merges, and highways, while allowing human preferences through natural language interfaces. Limitations include the absence of data-driven simulators in training and the need for perception-grounded state abstraction, with future work to integrate data-driven environments and semantic maps.

Abstract

Generating varied scenarios through simulation is crucial for training and evaluating safety-critical systems, such as autonomous vehicles. Yet, the task of modeling the trajectories of other vehicles to simulate diverse and meaningful close interactions remains prohibitively costly. Adopting language descriptions to generate driving behaviors emerges as a promising strategy, offering a scalable and intuitive method for human operators to simulate a wide range of driving interactions. However, the scarcity of large-scale annotated language-trajectory data makes this approach challenging. To address this gap, we propose Text-to-Drive (T2D) to synthesize diverse driving behaviors via Large Language Models (LLMs). We introduce a knowledge-driven approach that operates in two stages. In the first stage, we employ the embedded knowledge of LLMs to generate diverse language descriptions of driving behaviors for a scene. Then, we leverage LLM's reasoning capabilities to synthesize these behaviors in simulation. At its core, T2D employs an LLM to construct a state chart that maps low-level states to high-level abstractions. This strategy aids in downstream tasks such as summarizing low-level observations, assessing policy alignment with behavior description, and shaping the auxiliary reward, all without needing human supervision. With our knowledge-driven approach, we demonstrate that T2D generates more diverse trajectories compared to other baselines and offers a natural language interface that allows for interactive incorporation of human preference. Please check our website for more examples: https://text-to-drive.github.io/

Text-to-Drive: Diverse Driving Behavior Synthesis via Large Language Models

TL;DR

Text-to-Drive (T2D) presents a knowledge-driven framework that converts natural language descriptions of driving behaviors into diverse, executable policies for simulation. An LLM-based two-stage pipeline first generates behavior descriptions and then constructs low- to high-level state mappings and finite-state reward structures, enabling a driving policy to learn via multi-agent RL with primary and auxiliary rewards. The method preserves behavioral context across language, code, and policy, and demonstrates improved trajectory diversity across intersections, merges, and highways, while allowing human preferences through natural language interfaces. Limitations include the absence of data-driven simulators in training and the need for perception-grounded state abstraction, with future work to integrate data-driven environments and semantic maps.

Abstract

Generating varied scenarios through simulation is crucial for training and evaluating safety-critical systems, such as autonomous vehicles. Yet, the task of modeling the trajectories of other vehicles to simulate diverse and meaningful close interactions remains prohibitively costly. Adopting language descriptions to generate driving behaviors emerges as a promising strategy, offering a scalable and intuitive method for human operators to simulate a wide range of driving interactions. However, the scarcity of large-scale annotated language-trajectory data makes this approach challenging. To address this gap, we propose Text-to-Drive (T2D) to synthesize diverse driving behaviors via Large Language Models (LLMs). We introduce a knowledge-driven approach that operates in two stages. In the first stage, we employ the embedded knowledge of LLMs to generate diverse language descriptions of driving behaviors for a scene. Then, we leverage LLM's reasoning capabilities to synthesize these behaviors in simulation. At its core, T2D employs an LLM to construct a state chart that maps low-level states to high-level abstractions. This strategy aids in downstream tasks such as summarizing low-level observations, assessing policy alignment with behavior description, and shaping the auxiliary reward, all without needing human supervision. With our knowledge-driven approach, we demonstrate that T2D generates more diverse trajectories compared to other baselines and offers a natural language interface that allows for interactive incorporation of human preference. Please check our website for more examples: https://text-to-drive.github.io/
Paper Structure (21 sections, 2 equations, 8 figures, 2 tables)

This paper contains 21 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Given a scene description, T2D leverages Large Language Models to generate diverse descriptions of driving behaviors and then synthesizes them in simulation.
  • Figure 2: Overview.Left: First, an LLM generates diverse descriptions of driving behaviors, which can incorporate human preferences through a natural language interface. Middle: Next, an LLM generates a low-level state translator (LLST), primary function, and auxiliary function from a description of a driving behavior. The LLST translates low-level states to abstract states (see example in bottom middle block) and then records their state visit history (see example in bottom right block). The primary function gives a reward only when the vehicle exhibits the target behavior, using a finite-state machine for formal verification of behavior emergence (see example in bottom left block). The auxiliary function provides rewards for reaching intermediate states and can be iteratively updated. Right: Finally, we employ a standard multi-agent RL framework to train a driving policy using the primary and auxiliary functions as guidance.
  • Figure 3: Left: The auxiliary iterator LLM analyzes the policy after training to decide whether and how to adjust the auxiliary function based on the history of abstract state visits. Right: The right figure illustrates the LLM's reasoning process, where it reads a high-level behavior sequence, analyzes it, and then provides an accurate summary of the low-level trajectories.
  • Figure 4: Diverse driving behaviors at an intersection.
  • Figure 5: Diverse highway driving and merging behaviors.
  • ...and 3 more figures