Table of Contents
Fetching ...

CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

Jiarui Sun, Girish Chowdhary

TL;DR

CoMusion presents a single-stage, diffusion-based framework for stochastic human motion prediction that preserves spatial-temporal structure by coupling a Transformer-based reconstruction module with a GCN refinement module operating in the DCT space. Unlike noise-prediction approaches, it directly predicts the future motion $y_0$ conditioned on history $x^{1:H}$, aided by a tailored variance scheduler that ensures non-trivial denoising throughout the chain. The architecture achieves state-of-the-art results on Human3.6M and AMASS, with large gains in Cumulative Motion Distribution (CMD) and Fréchet Inception Distance (FID), and ablation studies validate the crucial roles of the reconstruction module, the GCN refinement, and the scheduler. The work demonstrates that integrating GCN-DCT design with diffusion modeling in a single stage can yield highly realistic, consistent, and diverse motion predictions, with practical efficiency and a released codebase.

Abstract

Stochastic Human Motion Prediction (HMP) aims to predict multiple possible future human pose sequences from observed ones. Most prior works learn motion distributions through encoding-decoding in the latent space, which does not preserve motion's spatial-temporal structure. While effective, these methods often require complex, multi-stage training and yield predictions that are inconsistent with the provided history and can be physically unrealistic. To address these issues, we propose CoMusion, a single-stage, end-to-end diffusion-based stochastic HMP framework. CoMusion is inspired from the insight that a smooth future pose initialization improves prediction performance, a strategy not previously utilized in stochastic models but evidenced in deterministic works. To generate such initialization, CoMusion's motion predictor starts with a Transformer-based network for initial reconstruction of corrupted motion. Then, a graph convolutional network (GCN) is employed to refine the prediction considering past observations in the discrete cosine transformation (DCT) space. Our method, facilitated by the Transformer-GCN module design and a proposed variance scheduler, excels in predicting accurate, realistic, and consistent motions, while maintaining appropriate diversity. Experimental results on benchmark datasets demonstrate that CoMusion surpasses prior methods across metrics, while demonstrating superior generation quality. Our Code is released at https://github.com/jsun57/CoMusion/ .

CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion

TL;DR

CoMusion presents a single-stage, diffusion-based framework for stochastic human motion prediction that preserves spatial-temporal structure by coupling a Transformer-based reconstruction module with a GCN refinement module operating in the DCT space. Unlike noise-prediction approaches, it directly predicts the future motion conditioned on history , aided by a tailored variance scheduler that ensures non-trivial denoising throughout the chain. The architecture achieves state-of-the-art results on Human3.6M and AMASS, with large gains in Cumulative Motion Distribution (CMD) and Fréchet Inception Distance (FID), and ablation studies validate the crucial roles of the reconstruction module, the GCN refinement, and the scheduler. The work demonstrates that integrating GCN-DCT design with diffusion modeling in a single stage can yield highly realistic, consistent, and diverse motion predictions, with practical efficiency and a released codebase.

Abstract

Stochastic Human Motion Prediction (HMP) aims to predict multiple possible future human pose sequences from observed ones. Most prior works learn motion distributions through encoding-decoding in the latent space, which does not preserve motion's spatial-temporal structure. While effective, these methods often require complex, multi-stage training and yield predictions that are inconsistent with the provided history and can be physically unrealistic. To address these issues, we propose CoMusion, a single-stage, end-to-end diffusion-based stochastic HMP framework. CoMusion is inspired from the insight that a smooth future pose initialization improves prediction performance, a strategy not previously utilized in stochastic models but evidenced in deterministic works. To generate such initialization, CoMusion's motion predictor starts with a Transformer-based network for initial reconstruction of corrupted motion. Then, a graph convolutional network (GCN) is employed to refine the prediction considering past observations in the discrete cosine transformation (DCT) space. Our method, facilitated by the Transformer-GCN module design and a proposed variance scheduler, excels in predicting accurate, realistic, and consistent motions, while maintaining appropriate diversity. Experimental results on benchmark datasets demonstrate that CoMusion surpasses prior methods across metrics, while demonstrating superior generation quality. Our Code is released at https://github.com/jsun57/CoMusion/ .
Paper Structure (45 sections, 16 equations, 8 figures, 10 tables)

This paper contains 45 sections, 16 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Top: Three joint motion trajectories (length 20), last 10 features vary among the last-observation-padded, noise-padded and groundtruth sequences. Bottom: Their corresponding DCT values.
  • Figure 2: Architecture of CoMusion's predictor $G_{\theta}(\cdot)$. Inputs include the $t^{th}$ level target noisy motion $y_t$, motion history $x$, and time step $t$. The motion predictor operates in two stages: (1) the Transformer-based motion reconstruction module $F(\cdot)$ initially reconstructs $\tilde{y}_0$ from $y_t$ and $t$, and (2) the GCN-based motion refinement module $R(\cdot)$ then generates the complete motion sequence using the concatenated inputs of $x$ and $\tilde{y}_0$. IDCT stands for Inverse DCT and PE for Positional Encoding.
  • Figure 3: Left: ADE computed at each prediction frame of state-of-the-art methods. Right: CMD computed up to each prediction frame. Both experiments are conducted on Human3.6M dataset.
  • Figure 4: Qualitative results of CoMusion compared with baseline methods. The upper block of rows corresponds to results obtained from the Human3.6M dataset, while the lower block of rows represents results from the AMASS dataset. The green-purple and the blue-orange skeletons denote the observed history and the predictions respectively.
  • Figure 5: Left: $y_T$, a Gaussian trajectory with $F = 100$ frames. Right: $F(y_T, T)$, the reconstructed trajectory. Compared with $y_T$, $F(y_T, T)$ depicts a much smoother temporal pattern with lower variance.
  • ...and 3 more figures