Table of Contents
Fetching ...

DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction

Hua Yu, Yaqing Hou, Wenbin Pei, Qiang Zhang

TL;DR

DivDiff tackles the one-to-many nature of realistic 3D human motion prediction by conditioning a denoising diffusion model on a learned state embedding of the observed sequence. The embedding, produced from Discrete Cosine Transform and Transformer blocks, guides the reverse diffusion, while the Diversified Reinforcement Sampling Function injects skeletal priors through a DPP-based objective to promote diverse yet plausible motions. Empirical results on Human3.6M and HumanEva-I show DivDiff achieves state-of-the-art diversity (APD) and competitive accuracy (ADE/FDE, MMADE/MMFDE), outperforming deterministic baselines and prior stochastic methods. The approach offers a scalable, efficient framework for diverse motion synthesis and could extend to related generative tasks such as motion estimation and image synthesis.

Abstract

Diverse human motion prediction (HMP) aims to predict multiple plausible future motions given an observed human motion sequence. It is a challenging task due to the diversity of potential human motions while ensuring an accurate description of future human motions. Current solutions are either low-diversity or limited in expressiveness. Recent denoising diffusion models (DDPM) hold potential generative capabilities in generative tasks. However, introducing DDPM directly into diverse HMP incurs some issues. Although DDPM can increase the diversity of the potential patterns of human motions, the predicted human motions become implausible over time because of the significant noise disturbances in the forward process of DDPM. This phenomenon leads to the predicted human motions being hard to control, seriously impacting the quality of predicted motions and restricting their practical applicability in real-world scenarios. To alleviate this, we propose a novel conditional diffusion-based generative model, called DivDiff, to predict more diverse and realistic human motions. Specifically, the DivDiff employs DDPM as our backbone and incorporates Discrete Cosine Transform (DCT) and transformer mechanisms to encode the observed human motion sequence as a condition to instruct the reverse process of DDPM. More importantly, we design a diversified reinforcement sampling function (DRSF) to enforce human skeletal constraints on the predicted human motions. DRSF utilizes the acquired information from human skeletal as prior knowledge, thereby reducing significant disturbances introduced during the forward process. Extensive results received in the experiments on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.

DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction

TL;DR

DivDiff tackles the one-to-many nature of realistic 3D human motion prediction by conditioning a denoising diffusion model on a learned state embedding of the observed sequence. The embedding, produced from Discrete Cosine Transform and Transformer blocks, guides the reverse diffusion, while the Diversified Reinforcement Sampling Function injects skeletal priors through a DPP-based objective to promote diverse yet plausible motions. Empirical results on Human3.6M and HumanEva-I show DivDiff achieves state-of-the-art diversity (APD) and competitive accuracy (ADE/FDE, MMADE/MMFDE), outperforming deterministic baselines and prior stochastic methods. The approach offers a scalable, efficient framework for diverse motion synthesis and could extend to related generative tasks such as motion estimation and image synthesis.

Abstract

Diverse human motion prediction (HMP) aims to predict multiple plausible future motions given an observed human motion sequence. It is a challenging task due to the diversity of potential human motions while ensuring an accurate description of future human motions. Current solutions are either low-diversity or limited in expressiveness. Recent denoising diffusion models (DDPM) hold potential generative capabilities in generative tasks. However, introducing DDPM directly into diverse HMP incurs some issues. Although DDPM can increase the diversity of the potential patterns of human motions, the predicted human motions become implausible over time because of the significant noise disturbances in the forward process of DDPM. This phenomenon leads to the predicted human motions being hard to control, seriously impacting the quality of predicted motions and restricting their practical applicability in real-world scenarios. To alleviate this, we propose a novel conditional diffusion-based generative model, called DivDiff, to predict more diverse and realistic human motions. Specifically, the DivDiff employs DDPM as our backbone and incorporates Discrete Cosine Transform (DCT) and transformer mechanisms to encode the observed human motion sequence as a condition to instruct the reverse process of DDPM. More importantly, we design a diversified reinforcement sampling function (DRSF) to enforce human skeletal constraints on the predicted human motions. DRSF utilizes the acquired information from human skeletal as prior knowledge, thereby reducing significant disturbances introduced during the forward process. Extensive results received in the experiments on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.
Paper Structure (19 sections, 13 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 13 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: The proposed DivDiff method aims to predict the future stochastic human motions (right) given the past human sequence (left). For example, the observed human motion is "a person runs straight forward", the potential future human motions might be "a person jumps", "a person hops", "a person crawls" and so on. The extensive experiments demonstrate that the proposed DivDiff has significantly enhanced the diversity and fidelity of the predicted human motions.
  • Figure 2: The generated samples from the proposed method are capable of covering more patterns (colored ellipses) compared to the CVAEs method. In the feature space, our method is able to capture a diverse set of future human motion patterns. However, the CVAEs method generates a large number of samples that are mainly concentrated on the major patterns of the motion distribution, failing to encompass the minor patterns.
  • Figure 3: The illustration of the proposed DivDiff method. The observed sequence $X$ is encoded by the DCT and transformer as a state embedding, which serves as a condition to guide the diffusion process. The ground truth future motions $\mathbf{\emph{Y}}^0$ incorporates $K$ times noise variables to the whiten noise $\mathbf{\emph{Y}}^K$. In the late stage of the reverse process, the designed DRSF utilizes GCN to learn the relationships between human skeletons and serves as prior knowledge. Meanwhile, DPP loss is employed to predict random noise and control the quality of the predicted results.
  • Figure 4: Qualitative comparison between other methods and the proposed DivDiff method. Given the observed motion sequence, the figure shows the end poses of ten future predictions. The proposed DivDiff method yields human motions with more diverse and realistic.
  • Figure 5: Comparison results of the proposed DivDiff method with (bottom) and without (top) DRSF. The results obtained from the proposed DivDiff method without DRSF exhibit lower diversity when compared to those obtained from the proposed DivDiff method with DRSF.
  • ...and 3 more figures