Table of Contents
Fetching ...

MotioNet: 3D Human Motion Reconstruction from Monocular Video with Skeleton Consistency

Mingyi Shi, Kfir Aberman, Andreas Aristidou, Taku Komura, Dani Lischinski, Daniel Cohen-Or, Baoquan Chen

TL;DR

MotioNet tackles monocular 3D human motion reconstruction by predicting a single, consistent skeleton (bone lengths ${\bf s}$) and a dynamic sequence of 3D joint rotations ${\bf q}$ with global root motions ${\bf r}$ and foot contacts ${\bf f}$, all fed into a differentiable forward-kinematics layer. The architecture comprises two encoders, $E_S$ for the static skeleton and $E_Q$ for rotations/root/contacts, plus a rotation-velocity adversarial loss ${\mathcal L}_{Q_{GAN}}$ and complementary losses for skeleton, joint positions, and foot contacts, enabling end-to-end learning of a complete motion representation without external IK. Key contributions include learning joint rotations directly from data within a unified FK-based pipeline, enforcing a single skeleton across time to preserve bone-length consistency, and robust performance under occlusion through noise- and confidence-based inputs and motion-space training. Experiments on CMU Mocap and Human3.6M show competitive joint-position accuracy while delivering smoother, more realistic, temporally coherent rotations suitable for animation pipelines, with online reconstruction showing potential for real-time deployment. Limitations include handling only a single character and the absence of explicit physical-interaction constraints, suggesting future work on multi-person scenes and physics-informed motion modeling.

Abstract

We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video.While previous methods rely on either rigging or inverse kinematics (IK) to associate a consistent skeleton with temporally coherent joint rotations, our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation. At the crux of our approach lies a deep neural network with embedded kinematic priors, which decomposes sequences of 2D joint positions into two separate attributes: a single, symmetric, skeleton, encoded by bone lengths, and a sequence of 3D joint rotations associated with global root positions and foot contact labels. These attributes are fed into an integrated forward kinematics (FK) layer that outputs 3D positions, which are compared to a ground truth. In addition, an adversarial loss is applied to the velocities of the recovered rotations, to ensure that they lie on the manifold of natural joint rotations. The key advantage of our approach is that it learns to infer natural joint rotations directly from the training data, rather than assuming an underlying model, or inferring them from joint positions using a data-agnostic IK solver. We show that enforcing a single consistent skeleton along with temporally coherent joint rotations constrains the solution space, leading to a more robust handling of self-occlusions and depth ambiguities.

MotioNet: 3D Human Motion Reconstruction from Monocular Video with Skeleton Consistency

TL;DR

MotioNet tackles monocular 3D human motion reconstruction by predicting a single, consistent skeleton (bone lengths ) and a dynamic sequence of 3D joint rotations with global root motions and foot contacts , all fed into a differentiable forward-kinematics layer. The architecture comprises two encoders, for the static skeleton and for rotations/root/contacts, plus a rotation-velocity adversarial loss and complementary losses for skeleton, joint positions, and foot contacts, enabling end-to-end learning of a complete motion representation without external IK. Key contributions include learning joint rotations directly from data within a unified FK-based pipeline, enforcing a single skeleton across time to preserve bone-length consistency, and robust performance under occlusion through noise- and confidence-based inputs and motion-space training. Experiments on CMU Mocap and Human3.6M show competitive joint-position accuracy while delivering smoother, more realistic, temporally coherent rotations suitable for animation pipelines, with online reconstruction showing potential for real-time deployment. Limitations include handling only a single character and the absence of explicit physical-interaction constraints, suggesting future work on multi-person scenes and physics-informed motion modeling.

Abstract

We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video.While previous methods rely on either rigging or inverse kinematics (IK) to associate a consistent skeleton with temporally coherent joint rotations, our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation. At the crux of our approach lies a deep neural network with embedded kinematic priors, which decomposes sequences of 2D joint positions into two separate attributes: a single, symmetric, skeleton, encoded by bone lengths, and a sequence of 3D joint rotations associated with global root positions and foot contact labels. These attributes are fed into an integrated forward kinematics (FK) layer that outputs 3D positions, which are compared to a ground truth. In addition, an adversarial loss is applied to the velocities of the recovered rotations, to ensure that they lie on the manifold of natural joint rotations. The key advantage of our approach is that it learns to infer natural joint rotations directly from the training data, rather than assuming an underlying model, or inferring them from joint positions using a data-agnostic IK solver. We show that enforcing a single consistent skeleton along with temporally coherent joint rotations constrains the solution space, leading to a more robust handling of self-occlusions and depth ambiguities.

Paper Structure

This paper contains 34 sections, 13 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Joint rotation ambiguity. Given a set of fixed 3D joint positions, multiple limb rotations can connect every pair of consecutive joints. Thus, recovered 3D joint positions alone are not sufficient for driving a rigged and skinned virtual 3D character.
  • Figure 2: Our framework receives 2D joint positions along with per-joint confidence values, which are simulated based on empirical experiments from real videos. It extracts per-frame joint rotations and global root positions along with foot contact labels, and a static (duration-independent) skeleton, using two encoders, $E_Q$ and $E_S$. The extracted rotations are fed into a discriminator $D$ that is trained to tune the temporal differences of the rotation angles to mimic the distribution of natural rotations, using adversarial training. In addition, the rotations and the static feature that is converted to a "T-pose", are fed into the forward kinematic layer, $FK$, that extracts 3D joint positions, which are compared to a ground truth.
  • Figure 3: Our network applies forward kinematics on a "T-pose" skeleton by successively rotating the limbs from the root to the end-effectors.
  • Figure 4: Our network contains two encoders, $E_Q$ which generates a temporal set of joint rotations, global positions and foot contact labels using parallel convolutions, and $E_S$ which outputs a static attribute that represents the skeleton, using an adaptive pooling layer that collapses the temporal axis.
  • Figure 5: Modeling the distribution of joint confidence values (continuous red line) using empirical distribution of confidence values (bins), extracted from videos in the wild.
  • ...and 6 more figures