Table of Contents
Fetching ...

MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments

Zhiyu Huang, Zewei Zhou, Tianhui Cai, Yun Zhang, Jiaqi Ma

TL;DR

MDG reframes multi-agent trajectory generation as masked denoising of structured spatiotemporal tensors, replacing diffusion steps and token-based generation with per-element noise masks and a Transformer denoiser. It supports one-step or few-step reconstruction, with flexible inference modes that enable open-loop prediction, closed-loop simulation, and planning within a single model. Trained on large real-world driving data, MDG achieves competitive closed-loop performance on Waymo Sim Agents and nuPlan benchmarks while offering efficient, controllable open-loop generation. The framework unifies diverse traffic-behavior tasks in a simple, scalable approach with strong implications for reusable, data-driven autonomous systems.

Abstract

Modeling realistic and interactive multi-agent behavior is critical to autonomous driving and traffic simulation. However, existing diffusion and autoregressive approaches are limited by iterative sampling, sequential decoding, or task-specific designs, which hinder efficiency and reuse. We propose Masked Denoising Generation (MDG), a unified generative framework that reformulates multi-agent behavior modeling as the reconstruction of independently noised spatiotemporal tensors. Instead of relying on diffusion time steps or discrete tokenization, MDG applies continuous, per-agent and per-timestep noise masks that enable localized denoising and controllable trajectory generation in a single or few forward passes. This mask-driven formulation generalizes across open-loop prediction, closed-loop simulation, motion planning, and conditional generation within one model. Trained on large-scale real-world driving datasets, MDG achieves competitive closed-loop performance on the Waymo Sim Agents and nuPlan Planning benchmarks, while providing efficient, consistent, and controllable open-loop multi-agent trajectory generation. These results position MDG as a simple yet versatile paradigm for multi-agent behavior modeling.

MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments

TL;DR

MDG reframes multi-agent trajectory generation as masked denoising of structured spatiotemporal tensors, replacing diffusion steps and token-based generation with per-element noise masks and a Transformer denoiser. It supports one-step or few-step reconstruction, with flexible inference modes that enable open-loop prediction, closed-loop simulation, and planning within a single model. Trained on large real-world driving data, MDG achieves competitive closed-loop performance on Waymo Sim Agents and nuPlan benchmarks while offering efficient, controllable open-loop generation. The framework unifies diverse traffic-behavior tasks in a simple, scalable approach with strong implications for reusable, data-driven autonomous systems.

Abstract

Modeling realistic and interactive multi-agent behavior is critical to autonomous driving and traffic simulation. However, existing diffusion and autoregressive approaches are limited by iterative sampling, sequential decoding, or task-specific designs, which hinder efficiency and reuse. We propose Masked Denoising Generation (MDG), a unified generative framework that reformulates multi-agent behavior modeling as the reconstruction of independently noised spatiotemporal tensors. Instead of relying on diffusion time steps or discrete tokenization, MDG applies continuous, per-agent and per-timestep noise masks that enable localized denoising and controllable trajectory generation in a single or few forward passes. This mask-driven formulation generalizes across open-loop prediction, closed-loop simulation, motion planning, and conditional generation within one model. Trained on large-scale real-world driving datasets, MDG achieves competitive closed-loop performance on the Waymo Sim Agents and nuPlan Planning benchmarks, while providing efficient, consistent, and controllable open-loop multi-agent trajectory generation. These results position MDG as a simple yet versatile paradigm for multi-agent behavior modeling.

Paper Structure

This paper contains 32 sections, 13 equations, 11 figures, 14 tables, 3 algorithms.

Figures (11)

  • Figure 1: Comparison of MDG with existing trajectory generation paradigms. MDG denoises masked spatiotemporal tensors under varied noise-masking patterns, enabling temporal-wise, agent-wise generation, guided conditioning, and closed-loop reuse. Unlike autoregressive models, MDG predicts full-sequence multi-agent futures in a single step, and unlike joint trajectory diffusion, it supports fine-grained control with efficient, flexible sampling.
  • Figure 2: Overview of the MDG model structure. The Scene Encoder integrates scene context, including agent states, map polylines, and traffic lights, using modality-specific networks and a query-centric Transformer to produce a unified scene representation. An auxiliary MLP head decodes from the representation of agents to predict trajectories as regularization. The ego route polylines are encoded via an MLP-Mixer network. The Denoiser processes mask-conditioned noised spatiotemporal trajectory tensor through stacked Transformer blocks with: intra-agent temporal self-attention, inter-agent interaction cross-attention, and agent-scene condition cross-attention, where only the ego agent attends to its route context for planning tasks. A final MLP head outputs clean, denoised trajectories.
  • Figure 3: Qualitative results of MDG in closed-loop multi-agent simulation and ego-agent planning tasks. MDG controls all agents in an interactive, map-compliant manner and navigates the ego vehicle effectively in complex scenarios.
  • Figure 4: Qualitative results of MDG in multi-agent open-loop prediction. The one-step denoising mode can produce plausible and interactive scenarios, but with limited sample diversity. By using multi-step denoising along the temporal axis, the generated scenarios exhibit greater diversity, and agents display obvious multimodal behaviors.
  • Figure 5: Illustration of the controllable generation task. Goals are assigned to target agents, and MDG guides them to reach the designated goals while maintaining reactions for surrounding agents.
  • ...and 6 more figures