TSDiT: Traffic Scene Diffusion Models With Transformers
Chen Yang, Tianyu Shi
TL;DR
TSDiT tackles realistic traffic-scene trajectory generation by fusing diffusion models with Transformer-based encoders in a world-centric framework. The model first learns action latent via a DDPM using Diffusion with Transformers blocks, then encodes agent, history, and HD-map features with dedicated Transformers (Other Agent Former, HD Map Former, Embedding Block) before decoding future trajectories with a trajectory decoder. Key contributions include the world-centric input representation, three specialized transformer modules for inter-agent and map interactions, and a diffusion-based action latent that yields diverse trajectories; evaluated on Waymo data with competitive ADE/FDE and strong qualitative behavior, especially for smooth turning. The approach offers a scalable, realistic pathway to traffic-scene generation for autonomous driving simulators and navigation systems.
Abstract
In this paper, we introduce a novel approach to trajectory generation for autonomous driving, combining the strengths of Diffusion models and Transformers. First, we use the historical trajectory data for efficient preprocessing and generate action latent using a diffusion model with DiT(Diffusion with Transformers) Blocks to increase scene diversity and stochasticity of agent actions. Then, we combine action latent, historical trajectories and HD Map features and put them into different transformer blocks. Finally, we use a trajectory decoder to generate future trajectories of agents in the traffic scene. The method exhibits superior performance in generating smooth turning trajectories, enhancing the model's capability to fit complex steering patterns. The experimental results demonstrate the effectiveness of our method in producing realistic and diverse trajectories, showcasing its potential for application in autonomous vehicle navigation systems.
