ModSkill: Physical Character Skill Modularization
Yiming Huang, Zhiyang Dou, Lingjie Liu
TL;DR
ModSkill addresses generalization in physical character motion by decoupling full-body skills into modular per-body-part skills, embodied as per-part embeddings $z^k_t$ guided by a skill modularization attention layer. It introduces an Active Skill Learning framework with Generative Adaptive Sampling to synthesize diverse, part-specific data using diffusion models conditioned on text descriptions. The approach yields superior full-body motion tracking and enables reusable modular skills for downstream tasks such as steering, reaching, and striking, performing competitively with or surpassing state-of-the-art baselines without the need for distillation. By leveraging modularity and generative augmentation, ModSkill scales to large motion datasets and supports flexible task transfer with practical impact in animation and robotics pipelines.
Abstract
Human motion is highly diverse and dynamic, posing challenges for imitation learning algorithms that aim to generalize motor skills for controlling simulated characters. Previous methods typically rely on a universal full-body controller for tracking reference motion (tracking-based model) or a unified full-body skill embedding space (skill embedding). However, these approaches often struggle to generalize and scale to larger motion datasets. In this work, we introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts. Our framework features a skill modularization attention layer that processes policy observations into modular skill embeddings that guide low-level controllers for each body part. We also propose an Active Skill Learning approach with Generative Adaptive Sampling, using large motion generation models to adaptively enhance policy learning in challenging tracking scenarios. Our results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking and enables reusable skill embeddings for diverse goal-driven tasks.
