Table of Contents
Fetching ...

MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion

Chiyu Max Jiang, Andre Cornman, Cheolho Park, Ben Sapp, Yin Zhou, Dragomir Anguelov

TL;DR

MotionDiffuser introduces a diffusion-model-based framework for joint multi-agent trajectory prediction that is permutation-invariant and capable of modeling highly multimodal futures. It combines a transformer-based set denoiser with PCA-augmented latent diffusion to efficiently represent trajectories and enable exact log-probability inference. A flexible constrained sampling scheme using differentiable costs (attractor and repeller) allows controllable trajectory synthesis, making it suitable for enforcing rules, priors, and custom scenarios. The method delivers state-of-the-art results on the Waymo Open Motion Dataset Interactive split and demonstrates robust ablations and controllable generation capabilities. Overall, MotionDiffuser advances probabilistic, interactive, and controllable motion forecasting for multi-agent systems in autonomous driving contexts.

Abstract

We present MotionDiffuser, a diffusion based representation for the joint distribution of future trajectories over multiple agents. Such representation has several key advantages: first, our model learns a highly multimodal distribution that captures diverse future outcomes. Second, the simple predictor design requires only a single L2 loss training objective, and does not depend on trajectory anchors. Third, our model is capable of learning the joint distribution for the motion of multiple agents in a permutation-invariant manner. Furthermore, we utilize a compressed trajectory representation via PCA, which improves model performance and allows for efficient computation of the exact sample log probability. Subsequently, we propose a general constrained sampling framework that enables controlled trajectory sampling based on differentiable cost functions. This strategy enables a host of applications such as enforcing rules and physical priors, or creating tailored simulation scenarios. MotionDiffuser can be combined with existing backbone architectures to achieve top motion forecasting results. We obtain state-of-the-art results for multi-agent motion prediction on the Waymo Open Motion Dataset.

MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion

TL;DR

MotionDiffuser introduces a diffusion-model-based framework for joint multi-agent trajectory prediction that is permutation-invariant and capable of modeling highly multimodal futures. It combines a transformer-based set denoiser with PCA-augmented latent diffusion to efficiently represent trajectories and enable exact log-probability inference. A flexible constrained sampling scheme using differentiable costs (attractor and repeller) allows controllable trajectory synthesis, making it suitable for enforcing rules, priors, and custom scenarios. The method delivers state-of-the-art results on the Waymo Open Motion Dataset Interactive split and demonstrates robust ablations and controllable generation capabilities. Overall, MotionDiffuser advances probabilistic, interactive, and controllable motion forecasting for multi-agent systems in autonomous driving contexts.

Abstract

We present MotionDiffuser, a diffusion based representation for the joint distribution of future trajectories over multiple agents. Such representation has several key advantages: first, our model learns a highly multimodal distribution that captures diverse future outcomes. Second, the simple predictor design requires only a single L2 loss training objective, and does not depend on trajectory anchors. Third, our model is capable of learning the joint distribution for the motion of multiple agents in a permutation-invariant manner. Furthermore, we utilize a compressed trajectory representation via PCA, which improves model performance and allows for efficient computation of the exact sample log probability. Subsequently, we propose a general constrained sampling framework that enables controlled trajectory sampling based on differentiable cost functions. This strategy enables a host of applications such as enforcing rules and physical priors, or creating tailored simulation scenarios. MotionDiffuser can be combined with existing backbone architectures to achieve top motion forecasting results. We obtain state-of-the-art results for multi-agent motion prediction on the Waymo Open Motion Dataset.
Paper Structure (29 sections, 18 equations, 6 figures, 3 tables)

This paper contains 29 sections, 18 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: MotionDiffuser is a learned representation for the distribution of multi-agent trajectories based on diffusion models. During inference, samples from the predicted joint future distribution are first drawn i.i.d. from a random normal distribution (leftmost column), and gradually denoised using a learned denoiser into the final predictions (rightmost column). Diffusion allows us to learn a diverse, multimodal distribution over joint outputs (top right). Furthermore, guidance in the form of a differentiable cost function can be applied at inference time to obtain results satisfying additional priors and constraints (bottom right).
  • Figure 2: Overview for multi-agent motion prediction using diffusion models. The input scene containing agent history, traffic lights and road graphs is encoded via a transformer encoder into a set of condition tokens $\bm{C}$. During training, a random set of noises are sampled i.i.d. from a normal distribution and added to the ground truth (GT) trajectory. The denoiser, while attending to the condition tokens, predicts the denoised trajectories corresponding to each agent. The entire model can be trained end-to-end using a simple L2 loss between the predicted denoised trajectory and the GT trajectory. During inference, a population of trajectories for each agent can first be sampled from pure noise at the highest noise level $\sigma_{\text{max}}$, and iteratively denoised by the denoiser to produce a plausible distribution of future trajectories. An optional constraint in the form of an arbitrary differentiable loss function can be injected in the denoising process to enforce constraints.
  • Figure 3: Network architecture for set denoiser $D_{\bm{\theta}}(\bm{S};\bm{C},\sigma)$. The noisy trajectories corresponding to agents $\bm{s}_1\cdots\bm{s}_{N_a}$ are first concatenated with a random-fourier encoded noise level $\sigma$, before going through repeated blocks of self-attention among the set of trajectories and cross-attention with respect to the condition tokens $\bm{c}_1\cdots\bm{c}_{N_c}$. The self-attention allows the diffusion model to learn a joint distribution across the agents and cross-attention allows the model to learn a more accurate scene-conditional distribution. Note that each agent cross-attends to its own condition tokens from the agent-centric scene encoding (not shown for simplicity). The [learnable components] are marked with brackets.
  • Figure 4: Inferred exact log probability of 64 sampled trajectories per agent. Higher probability samples are plotted with lighter colors. The orange agent represents the AV (autonomous vehicle).
  • Figure 5: Analysis of PCA representation for agent trajectories. (a) shows the average reconstruction error for varying numbers of principal components. (b) shows a visualization of the top-$10$ principal components. The higher modes representing higher frequencies are increasingly similar and have a small impact on the final trajectory.
  • ...and 1 more figures