Table of Contents
Fetching ...

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

Yuxiao Yang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao Wang, Bing Deng, Junzhe Lu, Haoqian Wang, Jieping Ye

TL;DR

EchoMotion tackles the challenge of synthesizing complex human motion by jointly modeling appearance and kinematics through a Dual-Modality Diffusion Transformer. It introduces Motion-Video Synchronized RoPE (MVS-RoPE) to align video and motion tokens in time and space, and a two-stage training regime that enables joint video-motion generation and cross-modal completion. The HuMoVe dataset provides the large-scale, high-quality paired video, SMPL motion parameters, and captions needed for training such a model. Empirical results show substantial improvements in anatomical plausibility and motion smoothness over video-only baselines, along with versatile cross-modal capabilities, signaling a new direction for kinematically-aware video synthesis.

Abstract

Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

TL;DR

EchoMotion tackles the challenge of synthesizing complex human motion by jointly modeling appearance and kinematics through a Dual-Modality Diffusion Transformer. It introduces Motion-Video Synchronized RoPE (MVS-RoPE) to align video and motion tokens in time and space, and a two-stage training regime that enables joint video-motion generation and cross-modal completion. The HuMoVe dataset provides the large-scale, high-quality paired video, SMPL motion parameters, and captions needed for training such a model. Empirical results show substantial improvements in anatomical plausibility and motion smoothness over video-only baselines, along with versatile cross-modal capabilities, signaling a new direction for kinematically-aware video synthesis.

Abstract

Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.

Paper Structure

This paper contains 34 sections, 14 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Overview of EchoMotion capabilities: (a) Improving anatomical integrity in human-centric video synthesis and (b) enabling bidirectional control between video and motion. By processing visual and motion sequences within a unified dual-branch Diffusion Transformer, the model learns a joint distribution of human appearance and kinematics.
  • Figure 2: Overview of EchoMotion. (a) The dual-modality DiT block for joint video-motion modeling. (b) Our MVS-RoPE to serve as a synchronized coordinate for dual-modal token sequence.
  • Figure 3: Overview of our Motion-Video Two-Stage Training Strategy. In Phase 1, the model is pretrained on motion-only data. In Phase 2, we conduct multi-task training on paired motion-video data, regarding "motion-with-video", "motion-to-video", and "video-to-motion" as three distinct tasks to be learned simultaneously.
  • Figure 4: Overview of our HuMoVe dataset. (a) Voronoi treemap of the dataset's composition. (b) Word cloud of the text captions. (c) Sample frames paired with their 3D mesh reconstructions.
  • Figure 5: Qualitative comparison with the 5B baseline. Our model (EchoMotion, right) generates anatomically correct and semantically coherent human motions, resolving the severe artifacts and compositional failures present in the baseline (left).
  • ...and 11 more figures