Controllable Diverse Sampling for Diffusion Based Motion Behavior Forecasting
Yiming Xu, Hao Cheng, Monika Sester
TL;DR
This work tackles multimodal trajectory forecasting for autonomous driving under data imbalance and potential mode-collapse. It introduces Controllable Diffusion Trajectory (CDT), a Transformer-based conditional diffusion model that uses map information and behavior tokens to steer sampling across multiple plausible modalities, with optional endpoint conditioning for accuracy. The approach demonstrates that behavior tokens improve diversity while map conditioning preserves scene compliance, and endpoint conditioning can yield high-accuracy predictions at the expense of diversity. On Argoverse 2, CDT variants achieve competitive or superior performance in diversity, scene compliance, and short-horizon accuracy, offering a practical path toward controllable, diverse motion prediction in complex urban environments.
Abstract
In autonomous driving tasks, trajectory prediction in complex traffic environments requires adherence to real-world context conditions and behavior multimodalities. Existing methods predominantly rely on prior assumptions or generative models trained on curated data to learn road agents' stochastic behavior bounded by scene constraints. However, they often face mode averaging issues due to data imbalance and simplistic priors, and could even suffer from mode collapse due to unstable training and single ground truth supervision. These issues lead the existing methods to a loss of predictive diversity and adherence to the scene constraints. To address these challenges, we introduce a novel trajectory generator named Controllable Diffusion Trajectory (CDT), which integrates map information and social interactions into a Transformer-based conditional denoising diffusion model to guide the prediction of future trajectories. To ensure multimodality, we incorporate behavioral tokens to direct the trajectory's modes, such as going straight, turning right or left. Moreover, we incorporate the predicted endpoints as an alternative behavioral token into the CDT model to facilitate the prediction of accurate trajectories. Extensive experiments on the Argoverse 2 benchmark demonstrate that CDT excels in generating diverse and scene-compliant trajectories in complex urban settings.
