Table of Contents
Fetching ...

Large Trajectory Models are Scalable Motion Predictors and Planners

Qiao Sun, Shiduo Zhang, Danjiao Ma, Jingzhe Shi, Derun Li, Simian Luo, Yu Wang, Ningyi Xu, Guangzhi Cao, Hang Zhao

TL;DR

This work presents State Transformer (STR), a scalable trajectory model that unifies motion prediction and motion planning as a single conditional sequence modeling problem by arranging context, past trajectories, and future directives into one sequence. Using a GPT-2–style causal Transformer backbone, STR introduces Context, Proposals, Key Points, and Future States, with a diffusion-based Key Points decoder to capture multimodality, and demonstrates strong scaling behavior on NuPlan and WOMD datasets. The approach achieves state-of-the-art or competitive results, shows robust generalization to unseen map topologies without extra high-level annotations, and reveals scaling laws similar to large language models, suggesting broad opportunities to leverage language-model architectures for autonomous driving tasks. Overall, STR offers a concise, adaptable framework that enables efficient learning, long-horizon reasoning, and cross-domain learning for planning and prediction in complex road environments.

Abstract

Motion prediction and planning are vital tasks in autonomous driving, and recent efforts have shifted to machine learning-based approaches. The challenges include understanding diverse road topologies, reasoning traffic dynamics over a long time horizon, interpreting heterogeneous behaviors, and generating policies in a large continuous state space. Inspired by the success of large language models in addressing similar complexities through model scaling, we introduce a scalable trajectory model called State Transformer (STR). STR reformulates the motion prediction and motion planning problems by arranging observations, states, and actions into one unified sequence modeling task. Our approach unites trajectory generation problems with other sequence modeling problems, powering rapid iterations with breakthroughs in neighbor domains such as language modeling. Remarkably, experimental results reveal that large trajectory models (LTMs), such as STR, adhere to the scaling laws by presenting outstanding adaptability and learning efficiency. Qualitative results further demonstrate that LTMs are capable of making plausible predictions in scenarios that diverge significantly from the training data distribution. LTMs also learn to make complex reasonings for long-term planning, without explicit loss designs or costly high-level annotations.

Large Trajectory Models are Scalable Motion Predictors and Planners

TL;DR

This work presents State Transformer (STR), a scalable trajectory model that unifies motion prediction and motion planning as a single conditional sequence modeling problem by arranging context, past trajectories, and future directives into one sequence. Using a GPT-2–style causal Transformer backbone, STR introduces Context, Proposals, Key Points, and Future States, with a diffusion-based Key Points decoder to capture multimodality, and demonstrates strong scaling behavior on NuPlan and WOMD datasets. The approach achieves state-of-the-art or competitive results, shows robust generalization to unseen map topologies without extra high-level annotations, and reveals scaling laws similar to large language models, suggesting broad opportunities to leverage language-model architectures for autonomous driving tasks. Overall, STR offers a concise, adaptable framework that enables efficient learning, long-horizon reasoning, and cross-domain learning for planning and prediction in complex road environments.

Abstract

Motion prediction and planning are vital tasks in autonomous driving, and recent efforts have shifted to machine learning-based approaches. The challenges include understanding diverse road topologies, reasoning traffic dynamics over a long time horizon, interpreting heterogeneous behaviors, and generating policies in a large continuous state space. Inspired by the success of large language models in addressing similar complexities through model scaling, we introduce a scalable trajectory model called State Transformer (STR). STR reformulates the motion prediction and motion planning problems by arranging observations, states, and actions into one unified sequence modeling task. Our approach unites trajectory generation problems with other sequence modeling problems, powering rapid iterations with breakthroughs in neighbor domains such as language modeling. Remarkably, experimental results reveal that large trajectory models (LTMs), such as STR, adhere to the scaling laws by presenting outstanding adaptability and learning efficiency. Qualitative results further demonstrate that LTMs are capable of making plausible predictions in scenarios that diverge significantly from the training data distribution. LTMs also learn to make complex reasonings for long-term planning, without explicit loss designs or costly high-level annotations.
Paper Structure (46 sections, 6 equations, 8 figures, 8 tables)

This paper contains 46 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The architecture of STR. There are four components in the sequence, namely Context, Proposal, Key Points, and Future States. Each part is encoded by its corresponding encoder. The causal transformer backbone, the GPT-2 model in our experiments, learns representations on the embedding level. The Proposal and Key Points are two optional components. A full generation process of STR is as follows. (i) STR selects Top K Proposals indicating future directions. (ii) STR generates a set of Key Points consecutively. (iii) STR generates the future states.
  • Figure 2: These two figures demonstrate the substantial scalability of STR, illustrating the scaling laws for training LTMs. The left figure reveals that LTMs exhibit smooth scalability with the size of the training dataset. When the training is not constrained by the size of the dataset, larger trajectory models tend to converge to a lower evaluation loss. The right figure shows that larger trajectory models learn faster to converge than their smaller counterparts, indicating superior data efficiency.
  • Figure 3: Qualitative analysis on trajectory models of different scales. The route given for each scenario is marked as green roadblocks. The ego vehicle to plan is marked as the dark blue box. All the other road users are marked as green boxes with their given size of shape as well as the yaw angles. The planning results are marked as larger circles in orange for larger models and purple for smaller models. These circles are sampled at every second from the trajectory of 8 seconds in total.
  • Figure 4: The distribution of the scenarios with each scenario tag in the training set.
  • Figure 5: Visualization of rasters in bird view. The upper row is the rasters with high resolution while the second row is the same scenario in low resolution visualization. (a) Start turning left. (b) Traversing intersection (c) Passing roundabout (d) Traversing pickup drop-off
  • ...and 3 more figures