Table of Contents
Fetching ...

Kimodo: Scaling Controllable Human Motion Generation

Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, Sanja Fidler

Abstract

High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.

Kimodo: Scaling Controllable Human Motion Generation

Abstract

High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.
Paper Structure (41 sections, 1 equation, 9 figures, 2 tables)

This paper contains 41 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Controllable Motion Generation. Kimodo supports flexible and intuitive control for motion generation through text prompting combined with an extensive suite of kinematic constraints. By training on 700 hours of optical mocap data, the model achieves precise control accuracy for a large variety of behaviors. In each example, constrained joints are indicated with a red color, and generated poses at constrained frames are highlighted in yellow. Time progression is indicated by lighter to darker blue coloring.
  • Figure 2: Motion Authoring Demo. (Left) Our authoring interface built with Viser yi2025viser allows intuitive control over Kimodo for motion generation. The timeline panel allows users to specify text prompts and constraints at specific frames or intervals, which are displayed in the 3D viewer. The options panel on the right side of the interface controls various generation parameters. (Right) In editing mode, users have fine-grained control to pose and translate the character at constrained frames. Editing and generation can be done on either the SOMA body skeleton saito2026soma or Unitree G1 robot.
  • Figure 3: Text-to-Motion Results. (Top) Kimodo enables generating high-quality human motions for a variety of behaviors on the SOMA body skeleton. Time progression is indicated by lighter to darker blue coloring. (Middle) Motions can also be generated directly on the G1 robot to easily collect plausible demonstrations. (Bottom) The same frame is visualized from ten different generated motion samples for the same prompt, demonstrating the diversity of Kimodo outputs.
  • Figure 4: Multi-Prompt Generation. Longer motion sequences can be generated from multiple prompts with the demo's timeline interface. Motions are generated sequentially with constraints between them for continuity.
  • Figure 5: Scaling Results. Scaling dataset size, model size, and batch size improves controllability and motion quality. Increased dataset size results in greatly improved constraint following, while model size and batch size are particularly helpful for text following (R-precision) and motion quality (FID). See \ref{['table:scaling']} for full results.
  • ...and 4 more figures