Table of Contents
Fetching ...

Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

Vongani H. Maluleke, Kie Horiuchi, Lea Wilken, Evonne Ng, Jitendra Malik, Angjoo Kanazawa

TL;DR

MAGNet introduces a unified multi-agent diffusion framework that models inter-agent coupling via relative transforms and a latent pose space, enabling dyadic and polyadic motion generation, partner inpainting, and joint future prediction. By integrating VQ-VAE pose tokens with a Diffusion Forcing Transformer and diverse sampling strategies, MAGNet achieves coherent long-horizon interactions across varying agent counts while maintaining real-time inference. The approach shows competitive performance across datasets and tasks, with improvements in interaction realism and diversity, and demonstrates flexible control through history guidance and agentic sampling. Limitations include occasional inter-agent penetrations due to lack of explicit physical constraints, suggesting future work incorporating physics priors for stronger physical plausibility.

Abstract

Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/

Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

TL;DR

MAGNet introduces a unified multi-agent diffusion framework that models inter-agent coupling via relative transforms and a latent pose space, enabling dyadic and polyadic motion generation, partner inpainting, and joint future prediction. By integrating VQ-VAE pose tokens with a Diffusion Forcing Transformer and diverse sampling strategies, MAGNet achieves coherent long-horizon interactions across varying agent counts while maintaining real-time inference. The approach shows competitive performance across datasets and tasks, with improvements in interaction realism and diversity, and demonstrates flexible control through history guidance and agentic sampling. Limitations include occasional inter-agent penetrations due to lack of explicit physical constraints, suggesting future work incorporating physics priors for stronger physical plausibility.

Abstract

Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/

Paper Structure

This paper contains 23 sections, 28 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: A Generative Model for Multi-Agent Interaction. We propose Multi-Agent Diffusion Forcing Transformer (MAGNet), a unified approach for modeling and generating realistic motion of multiple interacting humans. MAGNet handles diverse interactions from synchronized activities like dancing (top-left) to arbitrary social situations (top-right) with more than two people, generating sequences that can be rolled out for hundreds of steps, with diverse samples (bottom). A single trained model supports multiple tasks at test time: Partner Inpainting (generating agent motion given complete motion of others--top left), Joint Future Prediction (predicting all agents' futures from past motions--all others), and more. The model also supports agentic (turn-taking) sampling. Pink indicates known conditioning poses.
  • Figure 2: Coordinate Transform Representations. We use relative coordinate frames for both intra- and inter-person transforms, freeing the model from absolute frame definitions.
  • Figure 3: Multi-Agent Diffusion Forcing Transformer (MAGNet).Left (Training): Each agent’s motion is encoded by a VQ-VAE into latent pose tokens, forming motion tokens $m_i^p$ by appending latent vectors with transform parameters. Tokens from all agents are interleaved and processed by a Diffusion Forcing Transformer with independently noised tokens. Right (Inference): The model enables flexible conditioning: known (blank) tokens are fixed, while unknown tokens are causally denoised. This supports partner in-painting, joint prediction, and agentic turn-taking, where agents alternately generate motion and highlighted streams can run independently (e.g., on separate robots).
  • Figure 4: Samples from our model. We show samples from our model for different types of interaction and number of people. Our model generates realistic interactions including combat sports like boxing. In the bottom right, we show in-betweening results. Pink indicates known conditioning poses. Please also see the supplemental video.
  • Figure A.1: Example of an inter-agent penetration artifact generated by MAGNet. Trained without explicit physical constraints, the model fails to enforce non-collision, causing Agent A’s hand to pass through Agent B’s torso during the contact. This reflects a common limitation of data-driven motion models trained solely on motion capture data
  • ...and 1 more figures