Table of Contents
Fetching ...

ModSkill: Physical Character Skill Modularization

Yiming Huang, Zhiyang Dou, Lingjie Liu

TL;DR

ModSkill addresses generalization in physical character motion by decoupling full-body skills into modular per-body-part skills, embodied as per-part embeddings $z^k_t$ guided by a skill modularization attention layer. It introduces an Active Skill Learning framework with Generative Adaptive Sampling to synthesize diverse, part-specific data using diffusion models conditioned on text descriptions. The approach yields superior full-body motion tracking and enables reusable modular skills for downstream tasks such as steering, reaching, and striking, performing competitively with or surpassing state-of-the-art baselines without the need for distillation. By leveraging modularity and generative augmentation, ModSkill scales to large motion datasets and supports flexible task transfer with practical impact in animation and robotics pipelines.

Abstract

Human motion is highly diverse and dynamic, posing challenges for imitation learning algorithms that aim to generalize motor skills for controlling simulated characters. Previous methods typically rely on a universal full-body controller for tracking reference motion (tracking-based model) or a unified full-body skill embedding space (skill embedding). However, these approaches often struggle to generalize and scale to larger motion datasets. In this work, we introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts. Our framework features a skill modularization attention layer that processes policy observations into modular skill embeddings that guide low-level controllers for each body part. We also propose an Active Skill Learning approach with Generative Adaptive Sampling, using large motion generation models to adaptively enhance policy learning in challenging tracking scenarios. Our results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking and enables reusable skill embeddings for diverse goal-driven tasks.

ModSkill: Physical Character Skill Modularization

TL;DR

ModSkill addresses generalization in physical character motion by decoupling full-body skills into modular per-body-part skills, embodied as per-part embeddings guided by a skill modularization attention layer. It introduces an Active Skill Learning framework with Generative Adaptive Sampling to synthesize diverse, part-specific data using diffusion models conditioned on text descriptions. The approach yields superior full-body motion tracking and enables reusable modular skills for downstream tasks such as steering, reaching, and striking, performing competitively with or surpassing state-of-the-art baselines without the need for distillation. By leveraging modularity and generative augmentation, ModSkill scales to large motion datasets and supports flexible task transfer with practical impact in animation and robotics pipelines.

Abstract

Human motion is highly diverse and dynamic, posing challenges for imitation learning algorithms that aim to generalize motor skills for controlling simulated characters. Previous methods typically rely on a universal full-body controller for tracking reference motion (tracking-based model) or a unified full-body skill embedding space (skill embedding). However, these approaches often struggle to generalize and scale to larger motion datasets. In this work, we introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts. Our framework features a skill modularization attention layer that processes policy observations into modular skill embeddings that guide low-level controllers for each body part. We also propose an Active Skill Learning approach with Generative Adaptive Sampling, using large motion generation models to adaptively enhance policy learning in challenging tracking scenarios. Our results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking and enables reusable skill embeddings for diverse goal-driven tasks.

Paper Structure

This paper contains 21 sections, 7 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: We propose a modularized skill learning framework, ModSkill, that decouples full-body motion into skill embeddings for controlling individual body parts. Learned from large-scale motion datasets, these modular skills can be combined to control a simulated character to perform diverse motions, such as the Usain Bolt pose, and seamlessly reused for various downstream tasks.
  • Figure 2: Left: We extract modular skills from a large-scale motion dataset using a motion imitation objective, enabling low-level controllers to control various body parts of a physically simulated character. Active skill learning, through adaptive sampling from an off-the-shelf motion generation model, further enhances policy performance. Right: The learned modular skills can be transferred to downstream tasks by freezing the low-level controllers and training a high-level policy with task-specific rewards.
  • Figure 3: Skill Modularization Attention Layer: Given partial states for each body part, attention between body parts produces modular skill embeddings.
  • Figure 4: Generative Adaptive Sampling: When generating new samples for a reference motion (left), such as "a person kicks their left leg," the synthesized full-body motion (middle) introduces diverse variations from the original sequence. In contrast, synthesized motion for a specific body part (right) captures subtle differences, such as knee angles. The top row shows the target motion sequence, while the bottom row displays the imitated motion produced by our policy network. Red spheres indicate the corresponding target joint locations.
  • Figure 5: Our modular skill embeddings are flexible and informative, achieving natural human-like behavior in downstream tasks.
  • ...and 5 more figures