Versatile Physics-based Character Control with Hybrid Latent Representation
Jinseok Bae, Jungdam Won, Donggeun Lim, Inwoo Hwang, Young Min Kim
TL;DR
The paper tackles enabling versatile, physics-based character control under temporally or spatially sparse goals by introducing a hybrid latent representation that combines a discrete motion prior with a continuous residual. It employs Residual Vector Quantization to maximize code usage and sampling efficiency, while high-level policies adjust only small residuals, improving stability and training efficiency. The authors demonstrate three downstream tasks—motion in-betweening, head-mounted device tracking, and point-goal navigation—with superior motion fidelity, naturalness, and robustness compared to continuous or discrete baselines, supported by ablations on RVQ and codebook counts. This approach offers a scalable framework for reusable motion priors in complex control settings and points to future extensions to multi-subgoal scenarios and model-based planning strategies.
Abstract
We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to build a versatile motion prior that can be adapted to a wide range of challenging control tasks. Specifically, we build a discrete latent model to capture distinctive posterior distribution without collapse, and simultaneously augment the sampled vector with the continuous residuals to generate high-quality, smooth motion without jittering. We further incorporate Residual Vector Quantization, which not only maximizes the capacity of the discrete motion prior, but also efficiently abstracts the action space during the task learning phase. We demonstrate that our agent can produce diverse yet smooth motions simply by traversing the learned motion prior through unconditional motion generation. Furthermore, our model robustly satisfies sparse goal conditions with highly expressive natural motions, including head-mounted device tracking and motion in-betweening at irregular intervals, which could not be achieved with existing latent representations.
