Table of Contents
Fetching ...

MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang

TL;DR

MoSA tackles the challenge of generating motion-coherent human videos from text by decoupling structure and appearance. It uses a 3D structure transformer to synthesize a motion structure from text, then renders appearance conditioned on this structure with a diffusion-based backbone augmented by Human-Aware Dynamic Control and a dense tracking loss, plus a 3D contact constraint to model human–environment interactions. The authors introduce the MoVid dataset (30K real-world videos with diverse motions) to train and evaluate the method, and demonstrate state-of-the-art performance across FVD, CLIPSIM, and VBench metrics, with strong qualitative results and user preferences. This approach enables fine-grained control over complex body movements and interactions, and provides a valuable dataset for future research in realistic video synthesis of humans.

Abstract

Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training. The modeling of human-environment interactions is improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons between MoSA and a variety of approaches, including general video generation models, human video generation models, and human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the majority of evaluation metrics.

MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

TL;DR

MoSA tackles the challenge of generating motion-coherent human videos from text by decoupling structure and appearance. It uses a 3D structure transformer to synthesize a motion structure from text, then renders appearance conditioned on this structure with a diffusion-based backbone augmented by Human-Aware Dynamic Control and a dense tracking loss, plus a 3D contact constraint to model human–environment interactions. The authors introduce the MoVid dataset (30K real-world videos with diverse motions) to train and evaluate the method, and demonstrate state-of-the-art performance across FVD, CLIPSIM, and VBench metrics, with strong qualitative results and user preferences. This approach enables fine-grained control over complex body movements and interactions, and provides a valuable dataset for future research in realistic video synthesis of humans.

Abstract

Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training. The modeling of human-environment interactions is improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons between MoSA and a variety of approaches, including general video generation models, human video generation models, and human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the majority of evaluation metrics.

Paper Structure

This paper contains 39 sections, 10 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Illustration of the motivation. (a) shows sampled frames from videos generated with the prompt "running", where existing works mochi1yang2024cogvideox struggle to generate human videos with reasonable structures. (b) compares existing human video datasets wang2024humanvidtiktok and our Movid, where existing datasets mostly focus on facial or upper-body regions, or consist of vertically oriented dance videos. More samples Movid are provided in Fig. \ref{['fig:movid sample']} and supplementary materials.
  • Figure 2: Overview of the proposed MoSA. Given a text prompt $p$, we first employ a 3D structure transformer to generate a structure sequence, which is subsequently encoded as structural features to guide the appearance generation. To further enhance motion consistency, we introduce human-aware dynamic control modules. For brevity, the Gate modules in blocks have been omitted.
  • Figure 2: Effect on the Wan2.1 base model.
  • Figure 3: Visual comparison with existing video generation models. For clarity, VideoCrafter2 chen2024videocrafter2 is denoted as VC2, and HunyuanVideo kong2024hunyuanvideo is denoted as Hunyuan.
  • Figure 4: Effect of our decoupling framework MoSA when applied to Wan 2.1 wan2025wan.
  • ...and 12 more figures