Table of Contents
Fetching ...

Controllable Diverse Sampling for Diffusion Based Motion Behavior Forecasting

Yiming Xu, Hao Cheng, Monika Sester

TL;DR

This work tackles multimodal trajectory forecasting for autonomous driving under data imbalance and potential mode-collapse. It introduces Controllable Diffusion Trajectory (CDT), a Transformer-based conditional diffusion model that uses map information and behavior tokens to steer sampling across multiple plausible modalities, with optional endpoint conditioning for accuracy. The approach demonstrates that behavior tokens improve diversity while map conditioning preserves scene compliance, and endpoint conditioning can yield high-accuracy predictions at the expense of diversity. On Argoverse 2, CDT variants achieve competitive or superior performance in diversity, scene compliance, and short-horizon accuracy, offering a practical path toward controllable, diverse motion prediction in complex urban environments.

Abstract

In autonomous driving tasks, trajectory prediction in complex traffic environments requires adherence to real-world context conditions and behavior multimodalities. Existing methods predominantly rely on prior assumptions or generative models trained on curated data to learn road agents' stochastic behavior bounded by scene constraints. However, they often face mode averaging issues due to data imbalance and simplistic priors, and could even suffer from mode collapse due to unstable training and single ground truth supervision. These issues lead the existing methods to a loss of predictive diversity and adherence to the scene constraints. To address these challenges, we introduce a novel trajectory generator named Controllable Diffusion Trajectory (CDT), which integrates map information and social interactions into a Transformer-based conditional denoising diffusion model to guide the prediction of future trajectories. To ensure multimodality, we incorporate behavioral tokens to direct the trajectory's modes, such as going straight, turning right or left. Moreover, we incorporate the predicted endpoints as an alternative behavioral token into the CDT model to facilitate the prediction of accurate trajectories. Extensive experiments on the Argoverse 2 benchmark demonstrate that CDT excels in generating diverse and scene-compliant trajectories in complex urban settings.

Controllable Diverse Sampling for Diffusion Based Motion Behavior Forecasting

TL;DR

This work tackles multimodal trajectory forecasting for autonomous driving under data imbalance and potential mode-collapse. It introduces Controllable Diffusion Trajectory (CDT), a Transformer-based conditional diffusion model that uses map information and behavior tokens to steer sampling across multiple plausible modalities, with optional endpoint conditioning for accuracy. The approach demonstrates that behavior tokens improve diversity while map conditioning preserves scene compliance, and endpoint conditioning can yield high-accuracy predictions at the expense of diversity. On Argoverse 2, CDT variants achieve competitive or superior performance in diversity, scene compliance, and short-horizon accuracy, offering a practical path toward controllable, diverse motion prediction in complex urban environments.

Abstract

In autonomous driving tasks, trajectory prediction in complex traffic environments requires adherence to real-world context conditions and behavior multimodalities. Existing methods predominantly rely on prior assumptions or generative models trained on curated data to learn road agents' stochastic behavior bounded by scene constraints. However, they often face mode averaging issues due to data imbalance and simplistic priors, and could even suffer from mode collapse due to unstable training and single ground truth supervision. These issues lead the existing methods to a loss of predictive diversity and adherence to the scene constraints. To address these challenges, we introduce a novel trajectory generator named Controllable Diffusion Trajectory (CDT), which integrates map information and social interactions into a Transformer-based conditional denoising diffusion model to guide the prediction of future trajectories. To ensure multimodality, we incorporate behavioral tokens to direct the trajectory's modes, such as going straight, turning right or left. Moreover, we incorporate the predicted endpoints as an alternative behavioral token into the CDT model to facilitate the prediction of accurate trajectories. Extensive experiments on the Argoverse 2 benchmark demonstrate that CDT excels in generating diverse and scene-compliant trajectories in complex urban settings.
Paper Structure (17 sections, 6 equations, 5 figures, 2 tables)

This paper contains 17 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: CDT leverages traffic condition information and behavioral tokens as denoising conditions, sampling condition-based trajectories from Gaussian noise.
  • Figure 2: Our trajectory prediction framework utilizes a controllable diffusion model, consisting of an Encoder, a Confidence Decoder, a Classifier, and a transformer-based Denoiser. The Encoder encodes historical trajectories and maps, integrating these with time steps and behavioral conditions as inputs for the denoiser. In the diffusion phase, ground truth trajectories are incrementally degraded by Gaussian noise across T iterations. In the inference phase, The Denoiser iteratively denoises noisy data T steps. The model also outputs future behavior classifications and trajectory confidence levels by Confidence Decoder and Classifier.
  • Figure 3: Qualitative comparison of the models under complex scenarios in the validation set. Each column represents a unique intersection and each row represents the results predicted by the same model. The models include: CDT, the baseline model without any behavior tokens; CDT w/o map token, the model has behavior tokens but without map token. CDT-B, the model has both behavior and map token; CDT-P, the model has endpoint and map token; QCNet: the winner model on the Argoverse 2 leaderboard.
  • Figure 4: Normalized Confusion matrix of behavioral classification results in the validation set.
  • Figure 5: The network was trained with different numbers of denoising steps.