Table of Contents
Fetching ...

Optimal-state Dynamics Estimation for Physics-based Human Motion Capture from Videos

Cuong Le, Viktor Johansson, Manon Kok, Bastian Wandt

TL;DR

OSDCap tackles monocular 3D human motion capture by online fusion of video-derived kinematics with a differentiable physics simulator using a neural Kalman-filter style optimal-state estimator. The framework learns a Kalman gain, meta-PD controller gains, and an inertia-bias term to produce physically plausible pose trajectories while estimating external forces and torques. Across Human3.6M, Fit3D, and SportsPose, it achieves state-of-the-art performance among online methods on key metrics like MPJPE and PCK and demonstrates robustness to noisy observations. The approach advances practical, real-time physics-based motion capture with interpretable dynamics, suggesting further work on detailed hand/foot modeling and personalized body shapes.

Abstract

Human motion capture from monocular videos has made significant progress in recent years. However, modern approaches often produce temporal artifacts, e.g. in form of jittery motion and struggle to achieve smooth and physically plausible motions. Explicitly integrating physics, in form of internal forces and exterior torques, helps alleviating these artifacts. Current state-of-the-art approaches make use of an automatic PD controller to predict torques and reaction forces in order to re-simulate the input kinematics, i.e. the joint angles of a predefined skeleton. However, due to imperfect physical models, these methods often require simplifying assumptions and extensive preprocessing of the input kinematics to achieve good performance. To this end, we propose a novel method to selectively incorporate the physics models with the kinematics observations in an online setting, inspired by a neural Kalman-filtering approach. We develop a control loop as a meta-PD controller to predict internal joint torques and external reaction forces, followed by a physics-based motion simulation. A recurrent neural network is introduced to realize a Kalman filter that attentively balances the kinematics input and simulated motion, resulting in an optimal-state dynamics prediction. We show that this filtering step is crucial to provide an online supervision that helps balancing the shortcoming of the respective input motions, thus being important for not only capturing accurate global motion trajectories but also producing physically plausible human poses. The proposed approach excels in the physics-based human pose estimation task and demonstrates the physical plausibility of the predictive dynamics, compared to state of the art. The code is available on https://github.com/cuongle1206/OSDCap

Optimal-state Dynamics Estimation for Physics-based Human Motion Capture from Videos

TL;DR

OSDCap tackles monocular 3D human motion capture by online fusion of video-derived kinematics with a differentiable physics simulator using a neural Kalman-filter style optimal-state estimator. The framework learns a Kalman gain, meta-PD controller gains, and an inertia-bias term to produce physically plausible pose trajectories while estimating external forces and torques. Across Human3.6M, Fit3D, and SportsPose, it achieves state-of-the-art performance among online methods on key metrics like MPJPE and PCK and demonstrates robustness to noisy observations. The approach advances practical, real-time physics-based motion capture with interpretable dynamics, suggesting further work on detailed hand/foot modeling and personalized body shapes.

Abstract

Human motion capture from monocular videos has made significant progress in recent years. However, modern approaches often produce temporal artifacts, e.g. in form of jittery motion and struggle to achieve smooth and physically plausible motions. Explicitly integrating physics, in form of internal forces and exterior torques, helps alleviating these artifacts. Current state-of-the-art approaches make use of an automatic PD controller to predict torques and reaction forces in order to re-simulate the input kinematics, i.e. the joint angles of a predefined skeleton. However, due to imperfect physical models, these methods often require simplifying assumptions and extensive preprocessing of the input kinematics to achieve good performance. To this end, we propose a novel method to selectively incorporate the physics models with the kinematics observations in an online setting, inspired by a neural Kalman-filtering approach. We develop a control loop as a meta-PD controller to predict internal joint torques and external reaction forces, followed by a physics-based motion simulation. A recurrent neural network is introduced to realize a Kalman filter that attentively balances the kinematics input and simulated motion, resulting in an optimal-state dynamics prediction. We show that this filtering step is crucial to provide an online supervision that helps balancing the shortcoming of the respective input motions, thus being important for not only capturing accurate global motion trajectories but also producing physically plausible human poses. The proposed approach excels in the physics-based human pose estimation task and demonstrates the physical plausibility of the predictive dynamics, compared to state of the art. The code is available on https://github.com/cuongle1206/OSDCap

Paper Structure

This paper contains 25 sections, 14 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: OSDCap is an optimal-state dynamics estimation (cyan) based on two streams of input motion, a kinematics-based pose estimation from videos (top-left), and a physics-based simulation by a meta-PD controller (bottom-left). The predicted motion is physically-plausible, contains reduced high-frequency noise, while retaining highly accurate global position.
  • Figure 2: The main pipeline of OSDCap. Our approach consists of one neural network model, OSDNet (orange), and three processing components. OSDNet takes the current system state, estimates a Kalman gain matrix, PD gains, external force and an inertia-bias matrix. The optimal pose estimation performs contains a Kalman filter for the current system state and the input kinematics. Yellow refers to the algorithm's state vectors and cyan denotes processing operations. The physics priors block (gray) computes the inertia matrix and non-linear forces using the Composite rigid-body algorithm and Inverse dynamics featherstone_2008_rbd. Using the PD algorithm and forward dynamics (Eq. \ref{['eq:eom']}), the physics simulation block (green) updates the velocity based on the computed optimal pose and physics priors.
  • Figure 3: Qualitative results of OSDCap (cyan) compared to the kinematics input sun_2023_trace (purple), with corresponding ground truth pose (red). Left: Filtering results of OSDCap on a sample from SportsPose ingwersen_2023_sportspose, where the kinematics estimation is very inaccurate along the camera's depth dimension. The Kalman gain at the y-axis (optical axis) is greatly decreased due to the incorrect translation of the kinematics input. Therefore, the simulated state is preferred. Right: Example from Fit3D fieraru_2021_fit3d, with an unnaturally leaning pose caused by depth ambiguities. Unlike Fig. \ref{['fig:qualitative-1']}, the three poses are manually separated apart for better visualization. OSDCap recovers the physically plausible upright pose.
  • Figure 4: Architecture of the proposed OSDNet. The network consists of 3 hidden layer of size 512 to generate system state's embedding. Based on the state embedding, the inertia-bias matrix $\mathbf{M_{\text{base}}}^{b}_t$, PD gains $\boldsymbol{\kappa}_{P}, \boldsymbol{\kappa}_{D}$, Jacobian matrix $\mathbf{J}_t$, contact probability $\boldsymbol{\rho}^c_t$ and external force $\boldsymbol{\lambda}_t$ are estimated. The proposed GRU unit with size 128 takes the dynamics features (mentioned in Sec. \ref{['subsubsec:OSD']}) as input, the Kalman gain matrix $\mathbf{K}_t$ is estimated from the concatenation of GRU and the state embedding. The hidden state $h_{\text{gru}}$ is continuously updated at each time step. For a better estimation of foot-ground contacts and reaction forces, we also feed the feet position and linear velocity as additional inputs.
  • Figure 5: The simulated proxy character used in the paper. The RBDL library felis_2017_rbdl is used to extract the inertia matrix $\boldsymbol{M}_t$ and bias forces $h(q,\dot{q})$.
  • ...and 2 more figures