Table of Contents
Fetching ...

TSDiT: Traffic Scene Diffusion Models With Transformers

Chen Yang, Tianyu Shi

TL;DR

TSDiT tackles realistic traffic-scene trajectory generation by fusing diffusion models with Transformer-based encoders in a world-centric framework. The model first learns action latent via a DDPM using Diffusion with Transformers blocks, then encodes agent, history, and HD-map features with dedicated Transformers (Other Agent Former, HD Map Former, Embedding Block) before decoding future trajectories with a trajectory decoder. Key contributions include the world-centric input representation, three specialized transformer modules for inter-agent and map interactions, and a diffusion-based action latent that yields diverse trajectories; evaluated on Waymo data with competitive ADE/FDE and strong qualitative behavior, especially for smooth turning. The approach offers a scalable, realistic pathway to traffic-scene generation for autonomous driving simulators and navigation systems.

Abstract

In this paper, we introduce a novel approach to trajectory generation for autonomous driving, combining the strengths of Diffusion models and Transformers. First, we use the historical trajectory data for efficient preprocessing and generate action latent using a diffusion model with DiT(Diffusion with Transformers) Blocks to increase scene diversity and stochasticity of agent actions. Then, we combine action latent, historical trajectories and HD Map features and put them into different transformer blocks. Finally, we use a trajectory decoder to generate future trajectories of agents in the traffic scene. The method exhibits superior performance in generating smooth turning trajectories, enhancing the model's capability to fit complex steering patterns. The experimental results demonstrate the effectiveness of our method in producing realistic and diverse trajectories, showcasing its potential for application in autonomous vehicle navigation systems.

TSDiT: Traffic Scene Diffusion Models With Transformers

TL;DR

TSDiT tackles realistic traffic-scene trajectory generation by fusing diffusion models with Transformer-based encoders in a world-centric framework. The model first learns action latent via a DDPM using Diffusion with Transformers blocks, then encodes agent, history, and HD-map features with dedicated Transformers (Other Agent Former, HD Map Former, Embedding Block) before decoding future trajectories with a trajectory decoder. Key contributions include the world-centric input representation, three specialized transformer modules for inter-agent and map interactions, and a diffusion-based action latent that yields diverse trajectories; evaluated on Waymo data with competitive ADE/FDE and strong qualitative behavior, especially for smooth turning. The approach offers a scalable, realistic pathway to traffic-scene generation for autonomous driving simulators and navigation systems.

Abstract

In this paper, we introduce a novel approach to trajectory generation for autonomous driving, combining the strengths of Diffusion models and Transformers. First, we use the historical trajectory data for efficient preprocessing and generate action latent using a diffusion model with DiT(Diffusion with Transformers) Blocks to increase scene diversity and stochasticity of agent actions. Then, we combine action latent, historical trajectories and HD Map features and put them into different transformer blocks. Finally, we use a trajectory decoder to generate future trajectories of agents in the traffic scene. The method exhibits superior performance in generating smooth turning trajectories, enhancing the model's capability to fit complex steering patterns. The experimental results demonstrate the effectiveness of our method in producing realistic and diverse trajectories, showcasing its potential for application in autonomous vehicle navigation systems.
Paper Structure (24 sections, 15 equations, 3 figures, 2 tables)

This paper contains 24 sections, 15 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: World-centred model and Agent-centred model: (a) World-centred model: firstly, a "world centre" is defined, and then the map information and the positions of the agents in the traffic scene are transformed from absolute coordinates to frenet coordinates with the "world centre" as the origin. In the same traffic scenario, all features have the same coordinate origin, so the world-centred model can simultaneously output the future trajectories of all agents in the traffic scenario in one inference. (b) Agent-centred model: the map information and other agents' positions in the traffic scene are transformed from absolute coordinates to frenet coordinates with each agent as the origin, which means if there are $N$ agents in a traffic scene, the features of $N$ agents need to be input and infer $N$ times to obtain their trajectories.
  • Figure 2: Overview of Diffusion with Transformers
  • Figure 3: Overview of World-centric Encoders, which consists of four parts: (a) Embedding Blocks are used to encode the position and features of multimodal scene information (b) Spatial and Temporal Attention serves to encode the temporal and spatial information of predicted agents (c) Scene Formers model information from other modalities in the traffic scene in temporal and spatial dimensions (d) Fusion Blocks fuse the previous features and output the results to the trajectory encoder.