Optimal-state Dynamics Estimation for Physics-based Human Motion Capture from Videos
Cuong Le, Viktor Johansson, Manon Kok, Bastian Wandt
TL;DR
OSDCap tackles monocular 3D human motion capture by online fusion of video-derived kinematics with a differentiable physics simulator using a neural Kalman-filter style optimal-state estimator. The framework learns a Kalman gain, meta-PD controller gains, and an inertia-bias term to produce physically plausible pose trajectories while estimating external forces and torques. Across Human3.6M, Fit3D, and SportsPose, it achieves state-of-the-art performance among online methods on key metrics like MPJPE and PCK and demonstrates robustness to noisy observations. The approach advances practical, real-time physics-based motion capture with interpretable dynamics, suggesting further work on detailed hand/foot modeling and personalized body shapes.
Abstract
Human motion capture from monocular videos has made significant progress in recent years. However, modern approaches often produce temporal artifacts, e.g. in form of jittery motion and struggle to achieve smooth and physically plausible motions. Explicitly integrating physics, in form of internal forces and exterior torques, helps alleviating these artifacts. Current state-of-the-art approaches make use of an automatic PD controller to predict torques and reaction forces in order to re-simulate the input kinematics, i.e. the joint angles of a predefined skeleton. However, due to imperfect physical models, these methods often require simplifying assumptions and extensive preprocessing of the input kinematics to achieve good performance. To this end, we propose a novel method to selectively incorporate the physics models with the kinematics observations in an online setting, inspired by a neural Kalman-filtering approach. We develop a control loop as a meta-PD controller to predict internal joint torques and external reaction forces, followed by a physics-based motion simulation. A recurrent neural network is introduced to realize a Kalman filter that attentively balances the kinematics input and simulated motion, resulting in an optimal-state dynamics prediction. We show that this filtering step is crucial to provide an online supervision that helps balancing the shortcoming of the respective input motions, thus being important for not only capturing accurate global motion trajectories but also producing physically plausible human poses. The proposed approach excels in the physics-based human pose estimation task and demonstrates the physical plausibility of the predictive dynamics, compared to state of the art. The code is available on https://github.com/cuongle1206/OSDCap
