Table of Contents
Fetching ...

MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting

Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, Xue Bin Peng

TL;DR

MaskedMimic introduces a unified physics-based character controller that learns to generate full-body motions from partial, multi-modal constraints by formulating control as motion inpainting. It combines a two-stage training pipeline—RL-based fully-constrained motion tracking followed by distillation into a partially-constrained, CVAE-based prior—to enable inference from partial goals such as keyframes, text, or object interactions. The approach achieves robust performance across full-body tracking, VR-style joint sparsity, irregular terrains, and diverse object interactions, with a flexible goal-engineering interface that enables new tasks without task-specific reward design. By unifying control under a single model and leveraging multi-modal data, MaskedMimic simplifies the animation pipeline while maintaining physical plausibility and adaptability to complex scenes.

Abstract

Crafting a single, versatile physics-based controller that can breathe life into interactive characters across a wide spectrum of scenarios represents an exciting frontier in character animation. An ideal controller should support diverse control modalities, such as sparse target keyframes, text instructions, and scene information. While previous works have proposed physically simulated, scene-aware control models, these systems have predominantly focused on developing controllers that each specializes in a narrow set of tasks and control modalities. This work presents MaskedMimic, a novel approach that formulates physics-based character control as a general motion inpainting problem. Our key insight is to train a single unified model to synthesize motions from partial (masked) motion descriptions, such as masked keyframes, objects, text descriptions, or any combination thereof. This is achieved by leveraging motion tracking data and designing a scalable training method that can effectively utilize diverse motion descriptions to produce coherent animations. Through this process, our approach learns a physics-based controller that provides an intuitive control interface without requiring tedious reward engineering for all behaviors of interest. The resulting controller supports a wide range of control modalities and enables seamless transitions between disparate tasks. By unifying character control through motion inpainting, MaskedMimic creates versatile virtual characters. These characters can dynamically adapt to complex scenes and compose diverse motions on demand, enabling more interactive and immersive experiences.

MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting

TL;DR

MaskedMimic introduces a unified physics-based character controller that learns to generate full-body motions from partial, multi-modal constraints by formulating control as motion inpainting. It combines a two-stage training pipeline—RL-based fully-constrained motion tracking followed by distillation into a partially-constrained, CVAE-based prior—to enable inference from partial goals such as keyframes, text, or object interactions. The approach achieves robust performance across full-body tracking, VR-style joint sparsity, irregular terrains, and diverse object interactions, with a flexible goal-engineering interface that enables new tasks without task-specific reward design. By unifying control under a single model and leveraging multi-modal data, MaskedMimic simplifies the animation pipeline while maintaining physical plausibility and adaptability to complex scenes.

Abstract

Crafting a single, versatile physics-based controller that can breathe life into interactive characters across a wide spectrum of scenarios represents an exciting frontier in character animation. An ideal controller should support diverse control modalities, such as sparse target keyframes, text instructions, and scene information. While previous works have proposed physically simulated, scene-aware control models, these systems have predominantly focused on developing controllers that each specializes in a narrow set of tasks and control modalities. This work presents MaskedMimic, a novel approach that formulates physics-based character control as a general motion inpainting problem. Our key insight is to train a single unified model to synthesize motions from partial (masked) motion descriptions, such as masked keyframes, objects, text descriptions, or any combination thereof. This is achieved by leveraging motion tracking data and designing a scalable training method that can effectively utilize diverse motion descriptions to produce coherent animations. Through this process, our approach learns a physics-based controller that provides an intuitive control interface without requiring tedious reward engineering for all behaviors of interest. The resulting controller supports a wide range of control modalities and enables seamless transitions between disparate tasks. By unifying character control through motion inpainting, MaskedMimic creates versatile virtual characters. These characters can dynamically adapt to complex scenes and compose diverse motions on demand, enabling more interactive and immersive experiences.
Paper Structure (69 sections, 11 equations, 9 figures, 6 tables)

This paper contains 69 sections, 11 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Partial motion plans. MaskedMimic synthesizes full-body physics-based character animations. It achieves this by inpainting conditioned on multi-modal partial objectives. (a) The character climbs up a hill by tracking target head coordinates. (b) Text-to-motion synthesis enables the character to perform a waving motion. (c) The character navigates across irregular terrain by combining head-tracking with text-based style conditioning. (d) Interacting with a goal object, in this case sitting on an armchair, is achieved by conditioning on the object.
  • Figure 2: The MaskedMimic framework: The first phase produces a fully-constrained controller$\pi^\text{FC}$. This full-body tracker is trained using reinforcement learning to imitate kinematic motion recordings across a wide range of complex scene-aware contexts. The second phase produces MaskedMimic. Treating $\pi^\text{FC}$ as a teacher, through supervised limitation learning its knowledge is distilled into a partially-constrained controller$\pi^\text{PC}$. As $\pi^\text{PC}$ observes masked inputs, this process enables it to perform physics-based inpainting. Finally, at inference, without any further training, $\pi^\text{PC}$ is used to generate novel motions, in previously unseen scenes, from partial goals provided by the user.
  • Figure 3: Training scene (screenshot): The top region consists of standard flat terrain, enabling the controller to reproduce the original motions in a setting that best represents how they were recorded. The central region contains irregular terrain with stairs, slopes, and rough surfaces, allowing the controller to learn robust motion skills on varied ground geometries. The bottom region is reserved exclusively for object interactions, ensuring that the agent can practice interacting with objects in a clean and reproducible setup without interference from irregular terrain features.
  • Figure 4: MaskedMimic VAE Architecture.
  • Figure 5: Motion tracking: MaskedMimic generates full-body motion when tracking signals extracted from unseen kinematic motions. Precise fighting and dancing moves when tracking full-body information, a cartwheel from VR signals, and running by tracking the head (path following). The green spheres represent the target joint positions in each frame.
  • ...and 4 more figures