Table of Contents
Fetching ...

JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo

TL;DR

JointDiff tackles the gap between continuous trajectories and synchronous discrete events by introducing a joint diffusion framework for dynamic multi-agent scenes. It unifies the forward diffusion of continuous coordinates and discrete events, and learns a single reverse model with two heads, enabling controllable generation through weak-possessor-guidance and natural language text guidance via CrossGuid. The approach achieves state-of-the-art results on completion and controllable generation across basketball, football, and soccer datasets, and demonstrates superior scene-level coherence and consistency over absorbing-state baselines. This work advances interactive, controllable high-dimensional generation in sports analytics and multi-agent simulation, with potential extensions to sparse events and broader data modalities.

Abstract

Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.

JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

TL;DR

JointDiff tackles the gap between continuous trajectories and synchronous discrete events by introducing a joint diffusion framework for dynamic multi-agent scenes. It unifies the forward diffusion of continuous coordinates and discrete events, and learns a single reverse model with two heads, enabling controllable generation through weak-possessor-guidance and natural language text guidance via CrossGuid. The approach achieves state-of-the-art results on completion and controllable generation across basketball, football, and soccer datasets, and demonstrates superior scene-level coherence and consistency over absorbing-state baselines. This work advances interactive, controllable high-dimensional generation in sports analytics and multi-agent simulation, with potential extensions to sparse events and broader data modalities.

Abstract

Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.

Paper Structure

This paper contains 46 sections, 22 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: JointDiff. Our model jointly generates continuous trajectories and discrete events, with guidance provided through either weak-possessor information or natural language text. Stars () refer to the initial timestep.
  • Figure 2: Model Architecture.Left: The overall pipeline of our JointDiff model, which takes as input the noisy states $\bm{\mathrm{X}}_s$, observed states $\bm{\mathrm{X}}^{\mathrm{co}}$, mask $\mathbf{M}$, and optionally (referred with dashed connections) the encoded guidance signal $G$. Stars () refer to the initial timestep, $t=0$. The model processes these inputs through two Social-Temporal Blocks and outputs the predicted Gaussian noise ${\mathbf{\epsilon}}_\theta$ for trajectories and the event probability distribution $\pi_\theta$. Right: Detailed view of a Social-Temporal Block featuring our proposed CrossGuid module. The module has two distinct implementations corresponding to different guidance modalities (WPG and Text). The red line ($-$) in Social-Temporal Block indicates the data flow for non-controllable generation, where CrossGuid is bypassed. An extended diagram is available in Fig. \ref{['fig:architecture_ext']}.
  • Figure 3: Human evaluation on NBA future generation. The histogram reports the proportions of wins, ties, and losses for JointDiff against each baseline, with $n$ denoting the number of pairwise comparisons.
  • Figure 4: Controllable Generation. Comparison of JointDiff vs. Ours w/o joint on the text-guidance task giving the same past observations with different text prompts ${\mathcal{G}}_\text{text}$. Legend: Ball, Home team, Away team, $\bigcirc$ Past observations. See animated scenes in supplementary.
  • Figure 5: Model Architecture (extended version of Fig. \ref{['fig:architecture']}). The light gray box represents the JointDiff architecture, and STB refers to the Social-Temporal Block.
  • ...and 9 more figures