Table of Contents
Fetching ...

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu

TL;DR

This work tackles the generalization bottleneck in 3D human motion generation by bridging MoCap priors with large-scale ViGen knowledge. It introduces ViMoGen, a flow-matching diffusion transformer that fuses text, MoCap priors, and video-derived semantics via gated cross-modal blocks, plus ViMoGen-light for efficient inference. A large-scale ViMoGen-228K dataset and the novel MBench benchmark enable fine-grained, human-aligned evaluation across motion quality, prompt fidelity, and generalization. Experiments show state-of-the-art performance on generalization and text-motion alignment, with substantial gains from diverse data sources and powerful text encoders, while ViMoGen-light provides a scalable, efficient alternative.

Abstract

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

TL;DR

This work tackles the generalization bottleneck in 3D human motion generation by bridging MoCap priors with large-scale ViGen knowledge. It introduces ViMoGen, a flow-matching diffusion transformer that fuses text, MoCap priors, and video-derived semantics via gated cross-modal blocks, plus ViMoGen-light for efficient inference. A large-scale ViMoGen-228K dataset and the novel MBench benchmark enable fine-grained, human-aligned evaluation across motion quality, prompt fidelity, and generalization. Experiments show state-of-the-art performance on generalization and text-motion alignment, with substantial gains from diverse data sources and powerful text encoders, while ViMoGen-light provides a scalable, efficient alternative.

Abstract

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

Paper Structure

This paper contains 38 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview of our approach toward generalizable 3D human motion generation. (a) ViMoGen: Our model demonstrates superior generalization on challenging prompts including martial arts, dynamic sports, and multi-step behaviors. (b) MBench: Comprehensive benchmark evaluating models across dimensions, showing ViMoGen's significant improvements over existing methods. (c) ViMoGen-228K: Large-scale dataset with 228,000 motion sequences from diverse sources covering simple indoor to complex outdoor activities.
  • Figure 2: Overview of ViMoGen. (a) Our model takes a text prompt as input and leverages both a text encoder and an offline video generation model to produce textual and video motion tokens. These are fused with noisy motion inputs through a stack of gating Diffusion Blocks. (b) Each block includes self-attention, an adaptive gating module, and two cross-attention branches: Text-to-Motion (T2M) and Motion-to-Motion (M2M). Only one branch is activated at a time, enabling the model to adaptively balance robustness and generalization.
  • Figure 3: Overview of MBench. (a) MBench features more balanced distribution and vastly different prompt designs compared to HumanML3D. (b) MBench designed is to systematically evaluate motion generation algorithms across nine dimensions, focusing on motion quality, prompt-following, and generalization capability.
  • Figure 4: Qualitative comparison on MBench prompts. We show keywords in prompts for simplicity.
  • Figure 5: Human Preference Annotation Interface. Two rendered videos side-by-side with annotator choices.
  • ...and 6 more figures