Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment
Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao, Weiwei Xu
TL;DR
This work tackles reconstructing 3D human motion from monocular video by learning a cross-modal latent manifold that aligns 3D motion with 2D visual inputs. The authors introduce Video-to-Motion Generator (VTM), which couples a two-part motion auto-encoder (TPMAE) that learns skeleton-scale–independent motion priors with a two-part visual encoder (TPVE) that maps video frames and 2D keypoints into the same latent space, using a scale-invariant virtual skeleton $\bar{s}$ and bone-ratio prediction to recover full motion. A manifold alignment loss and joint training enable accurate reconstruction of complete motion (root translations, joint rotations, and a coherent skeleton), achieving competitive or state-of-the-art results on AIST++ while running at ~70fps and generalizing to unseen viewpoints and in-the-wild footage. The work demonstrates that cross-modal latent alignment is a powerful paradigm for monocular motion capture and highlights promising directions toward unsupervised or semi-supervised extensions to leverage large-scale video data without paired mocap.
Abstract
Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos.
