Table of Contents
Fetching ...

Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions

Jianan Li, Xiao Chen, Tao Huang, Tien-Tsin Wong

TL;DR

The paper tackles the challenge of learning realistic 3D character control from videos by eliminating the need for 3D motion data. It introduces Mimic2DM, a framework that learns a view-agnostic 2D motion tracking policy via reprojection minimization and physics constraints, with a hierarchical setup that couples a 2D autoregressive motion generator to synthesize 2D references. Key innovations include a multi-view aggregation approach to mitigate depth ambiguity, adaptive state initialization for efficient RL, and a canonical 2D motion representation with a VQ-VAE tokenizer to enable scalable generation. Experiments on HOI and animal motion demonstrate strong 3D tracking performance and competitive generation quality, highlighting the method’s applicability to in-the-wild videos and diverse characters without 3D supervision.

Abstract

Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: https://jiann-li.github.io/mimic2dm/

Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions

TL;DR

The paper tackles the challenge of learning realistic 3D character control from videos by eliminating the need for 3D motion data. It introduces Mimic2DM, a framework that learns a view-agnostic 2D motion tracking policy via reprojection minimization and physics constraints, with a hierarchical setup that couples a 2D autoregressive motion generator to synthesize 2D references. Key innovations include a multi-view aggregation approach to mitigate depth ambiguity, adaptive state initialization for efficient RL, and a canonical 2D motion representation with a VQ-VAE tokenizer to enable scalable generation. Experiments on HOI and animal motion demonstrate strong 3D tracking performance and competitive generation quality, highlighting the method’s applicability to in-the-wild videos and diverse characters without 3D supervision.

Abstract

Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: https://jiann-li.github.io/mimic2dm/

Paper Structure

This paper contains 43 sections, 9 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: The proposed Mimic2DM effectively learns character controllers for diverse motion types, including dynamic human dancing, complex ball interactions, and agile animal movements, by directly imitating 2D motion sequences extracted from in-the-wild videos.
  • Figure 2: Ambiguity caused by 3D-to-2D gap. When given the 2D tracking controller with only single-view 2D motion as inputs, controllers might struggle to achieve different poses, as illustrated in the figure, the orange colored character and the grey colored character both achieve minimized reprojection error for the given 2D reference.
  • Figure 3: Overview of the pipeline. Our approach Mimic2DM learns a view-agnostic tracking policy that imitates 2D motion sequences extracted from videos. The resulting policy can be zero-shot adapted to multi-view tracking via view aggregation, and integrated with a 2D motion generator for generative tasks.
  • Figure 4: Qualitative comparison between our method (bottom) and the adapted Sfv* peng2018sfv (middle) on imitating 2D reference motions (top) from Soccer Dribble and Animal. Sfv* struggles to reproduce complex ball–foot interactions (left) and often yields unnatural motions (right) due to inaccuracies in the reconstructed 3D training data. Sfv* is adapted to learn from 3D motions reconstructed via reprojection loss minimization.
  • Figure 5: Visualization of synthesized ball-dribbling motions using our hierarchical controller. Our framework learns diverse skills from 2D motion data, exhibits natural transitions between different dribbling styles (left), and demonstrates seamless composition and switching between directional movements.
  • ...and 4 more figures