PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

Yufei Zhang; Jeffrey O. Kephart; Zijun Cui; Qiang Ji

PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

Yufei Zhang, Jeffrey O. Kephart, Zijun Cui, Qiang Ji

TL;DR

PhysPT addresses the problem of physically implausible monocular 3D human motion estimates by marrying a self supervised Transformer with a physics aware body model and a continuous contact force model. It introduces Phys-SMPL to extract differentiable mass and inertia from SMPL geometry and derives Euler Lagrange based losses to enforce dynamics during training. The approach yields refined motion estimates and inferred forces without relying on 3D force labels or physics engines, and additionally improves downstream human action recognition when forces are incorporated. The framework demonstrates strong gains in physical plausibility, robustness to occlusion, and compatibility with various kinematics based backbones, highlighting its practical impact for real world motion capture from monocular video.

Abstract

While current methods have shown promising progress on estimating 3D human motion from monocular videos, their motion estimates are often physically unrealistic because they mainly consider kinematics. In this paper, we introduce Physics-aware Pretrained Transformer (PhysPT), which improves kinematics-based motion estimates and infers motion forces. PhysPT exploits a Transformer encoder-decoder backbone to effectively learn human dynamics in a self-supervised manner. Moreover, it incorporates physics principles governing human motion. Specifically, we build a physics-based body representation and contact force model. We leverage them to impose novel physics-inspired training losses (i.e., force loss, contact loss, and Euler-Lagrange loss), enabling PhysPT to capture physical properties of the human body and the forces it experiences. Experiments demonstrate that, once trained, PhysPT can be directly applied to kinematics-based estimates to significantly enhance their physical plausibility and generate favourable motion forces. Furthermore, we show that these physically meaningful quantities translate into improved accuracy of an important downstream task: human action recognition.

PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

TL;DR

Abstract

Paper Structure (22 sections, 31 equations, 8 figures, 7 tables)

This paper contains 22 sections, 31 equations, 8 figures, 7 tables.

Introduction
Related Work
Proposed Method
Preliminary
Physics-aware Pretrained Transformer
Transformer Encoder-Decoder Backbone
Physics-based Body Representation
Continuous Contact Force Model
Physics-inspired Training Losses
Experiment
Comparison with State-of-the-Arts (SOTAs)
Ablation Study
Improvements to Human Action Recognition
Conclusion
Acknowledgement
...and 7 more sections

Figures (8)

Figure 1: Method Overview. The proposed framework consists of a kinematics-based motion estimation model (orange) and a physics-aware pre-trained Transformer (green) for estimating human dynamics from a monocular video. Inset (a) illustrates joint actuation of right pelvis and contact forces at each foot. (b) illustrates reconstructed body motion and inferred forces with lighter colors representing greater joint actuation magnitudes (e.g. upper body joints when the figure is standing, and leg joints when it is walking).
Figure 2: Phys-SMPL. Besides 3D positions, Phys-SMPL models the volume ($V$), mass ($m$), and inertia ($\mathbf{I}$) of every body parts. Lighter colors represent larger body weight distributions.
Figure 3: Continuous Contact Force Model. The contact forces received by a point $p$ at time frame $t$ are determined by its velocity and distance to the ground through a spring-mass system built along the horizontal ($k_{h,t}$) and normal ($k_{n,t}$) directions.
Figure 4: Qualitative Evaluation on Utilizing PhysPT. The body color of each figure represents the reconstruction at different time frames (lighter colors indicate later time frames). Ground penetration and motion jittering exhibited in the reconstructed motion are marked by red circle and rectangle, respectively.
Figure 5: Qualitative Evaluation with Force Estimation Visualization. The testing image frames are from Human3.6M (left), 3DOH (middle), and PennAction (right).
...and 3 more figures

PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

TL;DR

Abstract

PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (8)