Table of Contents
Fetching ...

Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent

Yuxiao Chen, Sander Tonkens, Marco Pavone

TL;DR

CTT introduces an interpretable, tokenized latent framework for traffic prediction that jointly models accurate trajectories and semantically meaningful scene modes (a2l and a2a). By supervising the ground-truth scene modes from logs and using consistency losses for non-GT modes, it avoids mode collapse and enables diverse, controllable predictions while remaining compatible with LLMs through tokenized scene representations. The architecture fuses Transformer encoders with GNNs via Custom Edge Embedding to achieve equivariance and handle tokenized edges, and employs an energy-based, importance-sampled joint scene-mode predictor to maintain tractable yet expressive multimodal outputs. Empirical results on nuScenes, nuPlan, and Waymo Open Dataset show state-of-the-art accuracy, strong scene consistency, and robust cross-dataset generalization, with practical benefits for language-guided simulation and planning in autonomous driving.

Abstract

Adept traffic models are critical to both planning and closed-loop simulation for autonomous vehicles (AV), and key design objectives include accuracy, diverse multimodal behaviors, interpretability, and downstream compatibility. Recently, with the advent of large language models (LLMs), an additional desirable feature for traffic models is LLM compatibility. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables integration with LLMs for common-sense reasoning and zero-shot generalization.

Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent

TL;DR

CTT introduces an interpretable, tokenized latent framework for traffic prediction that jointly models accurate trajectories and semantically meaningful scene modes (a2l and a2a). By supervising the ground-truth scene modes from logs and using consistency losses for non-GT modes, it avoids mode collapse and enables diverse, controllable predictions while remaining compatible with LLMs through tokenized scene representations. The architecture fuses Transformer encoders with GNNs via Custom Edge Embedding to achieve equivariance and handle tokenized edges, and employs an energy-based, importance-sampled joint scene-mode predictor to maintain tractable yet expressive multimodal outputs. Empirical results on nuScenes, nuPlan, and Waymo Open Dataset show state-of-the-art accuracy, strong scene consistency, and robust cross-dataset generalization, with practical benefits for language-guided simulation and planning in autonomous driving.

Abstract

Adept traffic models are critical to both planning and closed-loop simulation for autonomous vehicles (AV), and key design objectives include accuracy, diverse multimodal behaviors, interpretability, and downstream compatibility. Recently, with the advent of large language models (LLMs), an additional desirable feature for traffic models is LLM compatibility. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables integration with LLMs for common-sense reasoning and zero-shot generalization.
Paper Structure (12 sections, 8 equations, 4 figures, 8 tables)

This paper contains 12 sections, 8 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Model architecture of CTT, where the encoder predicts the Scene Mode (SM) consisting of agent2lane (a2l) and agent2agent (a2a) modes, and the decoder generates trajectory predictions conditioned on the SM samples. The ground truth SM is identified directly from the driving log and thus decouples the encoder and decoder training (except for the shared context tensors).
  • Figure 2: Illustration of integrating CTT with GPT-4. The process starts with the perception module providing the scene description, including the road geometry, relevant agents, and relevant lane segments. CTT then provides several candidate scene modes, which specify the agent2lane (a2l) and agent2agent (a2a) modes. In this particular scene, the red ego vehicle may choose to yield the ambulance (CCW) or not yield (CW); potential lane changes are specified by a2l modes. Then we query GPT for suggestions, and GPT initially suggests a left lane change to yield to the ambulance while moving forward (SM (2)). However, after checking with CTT, the probability of SM (2) is very low, suggesting that human drivers tend to avoid a left lane change under the situation. Eventually GPT takes the feedback from CTT and suggests the ego to slow down and maintain lane, i.e., SM (3). The common-sense reasoning of GPT helped eliminate SM(1) whereas the "expert" driving knowledge of CTT helped eliminate SM (2)
  • Figure 3: Node variables (solid) and edge variables (transparent) with their axes (T: temporal, A: agent, and L: lane). GNN message passing (cyan dashed arrow), cross-attention (dashed red arrows) and self-attention (solid red arrows) are intertwined.
  • Figure 4: Predictions under different scene modes. The ego vehicle is in red and all other agents are in green. (a1) and (a2) show the impact of lane modes and (b1) and (b2) show the impact of homotopies, where the two homotopies correspond to the ego yielding or not yielding to the green merging vehicle.

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3