Perpetual Humanoid Control for Real-time Simulated Avatars
Zhengyi Luo, Jinkun Cao, Alexander Winkler, Kris Kitani, Weipeng Xu
TL;DR
The paper introduces Perpetual Humanoid Controller (PHC), a physics-based motion imitator capable of driving real-time avatars without resets and resilient to noisy inputs. It advances a Progressive Multiplicative Control Policy (PMCP) that grows network capacity by learning harder motion sequences through progressively trained primitives and a composer that fuses them, enabling scalable imitation of the AMASS dataset and fail-state recovery without catastrophic forgetting. The approach integrates Adversarial Motion Prior to ensure natural, human-like motion and supports input from video-based estimators or language-generated motion, including a keypoint-based variant that reduces reliance on joint rotations. PHC achieves state-of-the-art imitation performance (up to 98.9% success on MoCap data) and demonstrates robust real-time avatar control from video or language prompts, with reliable recovery from falls and detours. The work offers a practical pathway to perpetual, physically grounded avatars for telepresence, gaming, and embodied AI, while outlining directions for tighter pose-estimator integration and terrain-aware interactions.
Abstract
We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.
