Table of Contents
Fetching ...

PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction

Qiao Feng, Yiming Huang, Yufu Wang, Jiatao Gu, Lingjie Liu

TL;DR

PhysHMR presents a unified framework that reconstructs physically plausible humanoid motion from monocular video by directly mapping visual input to actions executed in a physics-based simulator. It couples local vision features from a pretrained encoder with a soft global grounding mechanism called pixel-as-ray, avoiding reliance on noisy 3D root estimates. A distillation strategy from a mocap-trained expert accelerates learning and stabilizes policy optimization via PPO, while a composite reward enforces imitation, realism, and smoothness. Experiments on multiple datasets show PhysHMR achieves high visual fidelity and superior physical plausibility, outperforming two-stage baselines in key realism metrics and user perception. The approach enables more reliable simulation-ready motion and has implications for robotics, animation, and embodied AI, though it remains offline due to reliance on a pretrained encoder.

Abstract

Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction

TL;DR

PhysHMR presents a unified framework that reconstructs physically plausible humanoid motion from monocular video by directly mapping visual input to actions executed in a physics-based simulator. It couples local vision features from a pretrained encoder with a soft global grounding mechanism called pixel-as-ray, avoiding reliance on noisy 3D root estimates. A distillation strategy from a mocap-trained expert accelerates learning and stabilizes policy optimization via PPO, while a composite reward enforces imitation, realism, and smoothness. Experiments on multiple datasets show PhysHMR achieves high visual fidelity and superior physical plausibility, outperforming two-stage baselines in key realism metrics and user perception. The approach enables more reliable simulation-ready motion and has implications for robotics, animation, and embodied AI, though it remains offline due to reliance on a pretrained encoder.

Abstract

Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

Paper Structure

This paper contains 26 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our pipeline. A visual-to-action policy reconstructs physically plausible motion from monocular videos. Training efficiency is improved by combining reinforcement learning and knowledge distillation. Global motion is guided using a pixel-as-ray module that lifts 2D keypoints into 3D rays.
  • Figure 2: Comparison against two physics-based methods. The black line indicates the ground. PhysPT (row 2) uses neural networks to approximate physics, but still suffers from ground penetration. PHC+ (row 3) amplifies motion reconstruction errors during tracking, leading to unstable results. Both methods cannot correct upstream errors. In contrast, our visual-to-action approach produces motion that is both physically plausible and visually aligned.
  • Figure 3: Mean reward curves during training. PPO Only converges slowly and underperforms. Distillation Only converges quickly but plateaus early. Our approach (PPO + Distillation) achieves both faster convergence and higher final rewards.