MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling
Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang
TL;DR
MoSA tackles the challenge of generating motion-coherent human videos from text by decoupling structure and appearance. It uses a 3D structure transformer to synthesize a motion structure from text, then renders appearance conditioned on this structure with a diffusion-based backbone augmented by Human-Aware Dynamic Control and a dense tracking loss, plus a 3D contact constraint to model human–environment interactions. The authors introduce the MoVid dataset (30K real-world videos with diverse motions) to train and evaluate the method, and demonstrate state-of-the-art performance across FVD, CLIPSIM, and VBench metrics, with strong qualitative results and user preferences. This approach enables fine-grained control over complex body movements and interactions, and provides a valuable dataset for future research in realistic video synthesis of humans.
Abstract
Existing video generation models predominantly emphasize appearance fidelity while exhibiting limited ability to synthesize complex human motions, such as whole-body movements, long-range dynamics, and fine-grained human-environment interactions. This often leads to unrealistic or physically implausible movements with inadequate structural coherence. To conquer these challenges, we propose MoSA, which decouples the process of human video generation into two components, i.e., structure generation and appearance generation. MoSA first employs a 3D structure transformer to generate a human motion sequence from the text prompt. The remaining video appearance is then synthesized under the guidance of this structural sequence. We achieve fine-grained control over the sparse human structures by introducing Human-Aware Dynamic Control modules with a dense tracking constraint during training. The modeling of human-environment interactions is improved through the proposed contact constraint. Those two components work comprehensively to ensure the structural and appearance fidelity across the generated videos. This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets. We conduct comprehensive comparisons between MoSA and a variety of approaches, including general video generation models, human video generation models, and human animation models. Experiments demonstrate that MoSA substantially outperforms existing approaches across the majority of evaluation metrics.
