Table of Contents
Fetching ...

Realistic Human Motion Generation with Cross-Diffusion Models

Zeping Ren, Shaoli Huang, Xiu Li

TL;DR

CrossDiff addresses the challenge of text-to-motion generation by fusing 3D poses with 2D projections through a unified transformer-based encoder and a cross-diffusion framework. It couples mixed representations via a shared-weights encoder and cross-decoders, enabling four cross-predictions and a two-stage training scheme with mixture sampling that leverages 2D information to improve 3D motion quality. Empirically, CrossDiff achieves competitive state-of-the-art results on HumanML3D and KIT-ML while demonstrating the ability to learn 3D motion from 2D data and perform zero-shot generation with pseudo-labels, thus broadening data applicability. The approach enhances realism by capturing subtle full-body dynamics and offers practical benefits for animation, VR, and robotics where 2D data is plentiful but 3D ground truth is scarce.

Abstract

We introduce the Cross Human Motion Diffusion Model (CrossDiff), a novel approach for generating high-quality human motion based on textual descriptions. Our method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model, unifying motion noise into a single feature space. This enables cross-decoding of features into both 3D and 2D motion representations, regardless of their original dimension. The primary advantage of CrossDiff is its cross-diffusion mechanism, which allows the model to reverse either 2D or 3D noise into clean motion during training. This capability leverages the complementary information in both motion representations, capturing intricate human movement details often missed by models relying solely on 3D information. Consequently, CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences. In our experiments, our model demonstrates competitive state-of-the-art performance on text-to-motion benchmarks. Moreover, our method consistently provides enhanced motion generation quality, capturing complex full-body movement intricacies. Additionally, with a pretrained model,our approach accommodates using in the wild 2D motion data without 3D motion ground truth during training to generate 3D motion, highlighting its potential for broader applications and efficient use of available data resources. Project page: https://wonderno.github.io/CrossDiff-webpage/.

Realistic Human Motion Generation with Cross-Diffusion Models

TL;DR

CrossDiff addresses the challenge of text-to-motion generation by fusing 3D poses with 2D projections through a unified transformer-based encoder and a cross-diffusion framework. It couples mixed representations via a shared-weights encoder and cross-decoders, enabling four cross-predictions and a two-stage training scheme with mixture sampling that leverages 2D information to improve 3D motion quality. Empirically, CrossDiff achieves competitive state-of-the-art results on HumanML3D and KIT-ML while demonstrating the ability to learn 3D motion from 2D data and perform zero-shot generation with pseudo-labels, thus broadening data applicability. The approach enhances realism by capturing subtle full-body dynamics and offers practical benefits for animation, VR, and robotics where 2D data is plentiful but 3D ground truth is scarce.

Abstract

We introduce the Cross Human Motion Diffusion Model (CrossDiff), a novel approach for generating high-quality human motion based on textual descriptions. Our method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model, unifying motion noise into a single feature space. This enables cross-decoding of features into both 3D and 2D motion representations, regardless of their original dimension. The primary advantage of CrossDiff is its cross-diffusion mechanism, which allows the model to reverse either 2D or 3D noise into clean motion during training. This capability leverages the complementary information in both motion representations, capturing intricate human movement details often missed by models relying solely on 3D information. Consequently, CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences. In our experiments, our model demonstrates competitive state-of-the-art performance on text-to-motion benchmarks. Moreover, our method consistently provides enhanced motion generation quality, capturing complex full-body movement intricacies. Additionally, with a pretrained model,our approach accommodates using in the wild 2D motion data without 3D motion ground truth during training to generate 3D motion, highlighting its potential for broader applications and efficient use of available data resources. Project page: https://wonderno.github.io/CrossDiff-webpage/.
Paper Structure (28 sections, 9 equations, 6 figures, 2 tables)

This paper contains 28 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our method utilizing the cross-diffusion mechanism (Left) exhibits more full-body details compared to existing methods (Right).
  • Figure 2: Overview of our CrossDiff framework for generating human motion from textual descriptions. The framework incorporates both 3D and 2D motion data, using unified encoding and cross-decoding components to process mixed representations obtained from random projection.
  • Figure 3: Overview of Mixture Sampling. The original noise is sampled from a 2D gaussian distribution. From time-step $T$ to $\alpha$, CrossDiff predicts the clean 2D motion $\hat{x}_{2D,0}$ and diffuses it back to $x_{2D,t-1}$. In the remaining $\alpha$ steps, CrossDiff denoises in the 3D domain and finally obtains the clean 3D motion.
  • Figure 4: Qualitative results on HumanML3D dataset. We compare our method with MDM tevet2022human, T2M-GPT zhang2023t2m and MLD chen2023executing. We find that our generated actions better convey the intended semantics.
  • Figure 5: (a)The result of the user study. (b) Difference between 3D and 2D motion data distribution. The time axis is represented on the x-axis, while the normalized joint velocity is represented on the y-axis. The 3D motion is represented by a blue full line, while the 2D motion is represented by red and green dashed lines, indicating the front and left view, respectively.
  • ...and 1 more figures