Table of Contents
Fetching ...

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao, Weiwei Xu

TL;DR

This work tackles reconstructing 3D human motion from monocular video by learning a cross-modal latent manifold that aligns 3D motion with 2D visual inputs. The authors introduce Video-to-Motion Generator (VTM), which couples a two-part motion auto-encoder (TPMAE) that learns skeleton-scale–independent motion priors with a two-part visual encoder (TPVE) that maps video frames and 2D keypoints into the same latent space, using a scale-invariant virtual skeleton $\bar{s}$ and bone-ratio prediction to recover full motion. A manifold alignment loss and joint training enable accurate reconstruction of complete motion (root translations, joint rotations, and a coherent skeleton), achieving competitive or state-of-the-art results on AIST++ while running at ~70fps and generalizing to unseen viewpoints and in-the-wild footage. The work demonstrates that cross-modal latent alignment is a powerful paradigm for monocular motion capture and highlights promising directions toward unsupervised or semi-supervised extensions to leverage large-scale video data without paired mocap.

Abstract

Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos.

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

TL;DR

This work tackles reconstructing 3D human motion from monocular video by learning a cross-modal latent manifold that aligns 3D motion with 2D visual inputs. The authors introduce Video-to-Motion Generator (VTM), which couples a two-part motion auto-encoder (TPMAE) that learns skeleton-scale–independent motion priors with a two-part visual encoder (TPVE) that maps video frames and 2D keypoints into the same latent space, using a scale-invariant virtual skeleton and bone-ratio prediction to recover full motion. A manifold alignment loss and joint training enable accurate reconstruction of complete motion (root translations, joint rotations, and a coherent skeleton), achieving competitive or state-of-the-art results on AIST++ while running at ~70fps and generalizing to unseen viewpoints and in-the-wild footage. The work demonstrates that cross-modal latent alignment is a powerful paradigm for monocular motion capture and highlights promising directions toward unsupervised or semi-supervised extensions to leverage large-scale video data without paired mocap.

Abstract

Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos.
Paper Structure (16 sections, 13 equations, 5 figures, 3 tables)

This paper contains 16 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our VTM can reconstruct 3D human motion from a wide range of
  • Figure 2: System overview of our VTM. The TPMAE (including motion encoders $\mathcal{E}_u^M$ & $\mathcal{E}_l^M$, motion decoders $\mathcal{D}_u$ & $\mathcal{D}_l$, and root decoder $\mathcal{D}_r$) are first trained on the motion data to learn two latent manifolds as the motion priors. Then, the TPVE (including 2D keypoints feature extractors $\mathcal{K}_u$ & $\mathcal{K}_l$, visual fusion blocks $\mathcal{F}_u$ & $\mathcal{F}_l$, visual encoders $\mathcal{E}_u^V$ & $\mathcal{E}_l^V$ and bone ratio predictor $\mathcal{E}^B$) are jointly trained with the pre-trained TPMAE to align the visual features with the motion priors for reconstructing 3D human motion. The superscripts $M$ and $V$ represent the "Motion" and "Video"; the subscripts $u$, $l$ and $r$ mean the "upper body part", "lower body part" and "root".
  • Figure 3: Qualitative comparisons to other SOTA methods. The green skeletons represent the ground truth poses, and the red ones represent the reconstructed poses by different methods.
  • Figure 4: VTM can generalize to different view angles. The first row is the video frames from different camera settings, and the second row is the same pose viewed from the angles corresponding to the camera settings. Only videos and 2D keypoints from camera setting 1, namely c1, are used for training our VTM.
  • Figure 5: VTM can reconstruct motion from in-the-wild videos. The first row is the continuous frames from a wild video, and the second row shows the corresponding results produced by VTM.