Table of Contents
Fetching ...

ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

Tao Tang, Hong Liu, Yingxuan You, Ti Wang, Wenhao Li

TL;DR

A novel semi-Analytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS, which surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks.

Abstract

Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsistent motion. Inspired by the rapid advance in human pose estimation, we discover that compared to image features, skeletons inherently contain accurate human pose and motion. Therefore, we propose a novel semiAnalytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS. Specifically, a skeleton estimation and disentanglement module is proposed to estimate the 3D skeletons from a video and decouple them into disentangled skeletal representations (i.e., joint position, bone length, and human motion). Then, to fully utilize these representations, we introduce a semi-analytical regressor to estimate the parameters of the human mesh model. The regressor consists of three modules: Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate initial pose parameters and BSF leverages bone length to regress bone-aligned shape parameters. Finally, MCR combines human motion representation with image features to refine the initial human model parameters. Extensive experiments demonstrate that our ARTS surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M. Code is available at https://github.com/TangTao-PKU/ARTS.

ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos

TL;DR

A novel semi-Analytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS, which surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks.

Abstract

Although existing video-based 3D human mesh recovery methods have made significant progress, simultaneously estimating human pose and shape from low-resolution image features limits their performance. These image features lack sufficient spatial information about the human body and contain various noises (e.g., background, lighting, and clothing), which often results in inaccurate pose and inconsistent motion. Inspired by the rapid advance in human pose estimation, we discover that compared to image features, skeletons inherently contain accurate human pose and motion. Therefore, we propose a novel semiAnalytical Regressor using disenTangled Skeletal representations for human mesh recovery from videos, called ARTS. Specifically, a skeleton estimation and disentanglement module is proposed to estimate the 3D skeletons from a video and decouple them into disentangled skeletal representations (i.e., joint position, bone length, and human motion). Then, to fully utilize these representations, we introduce a semi-analytical regressor to estimate the parameters of the human mesh model. The regressor consists of three modules: Temporal Inverse Kinematics (TIK), Bone-guided Shape Fitting (BSF), and Motion-Centric Refinement (MCR). TIK utilizes joint position to estimate initial pose parameters and BSF leverages bone length to regress bone-aligned shape parameters. Finally, MCR combines human motion representation with image features to refine the initial human model parameters. Extensive experiments demonstrate that our ARTS surpasses existing state-of-the-art video-based methods in both per-frame accuracy and temporal consistency on popular benchmarks: 3DPW, MPI-INF-3DHP, and Human3.6M. Code is available at https://github.com/TangTao-PKU/ARTS.

Paper Structure

This paper contains 32 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison between the previous video-based HMR methods and our ARTS. (a) Previous video-based HMR methods estimate the human pose and shape from low-resolution image features. (b) Our ARTS effectively utilizes disentangled skeletal representations (i.e., Motions, Joints, Bones) with image features to estimate and refine the human pose and shape.
  • Figure 2: Overview of the proposed ARTS. Given a video sequence, ResNet resnet is utilized to extract the image features $F$ of each frame. We estimate the 3D skeletons and decouple them into joints, motions, and bones. Then, in the semi-analytical SMPL regressor, Temporal Inverse Kinematics (TIK) obtains initial SMPL pose parameters $\theta_{init}$ from joints and image features. Bone-guided Shape Fitting (BSF) gets bone-aligned SMPL shape parameters $\beta_{init}$ from bones. Moreover, we utilize motions to guide the fusion of image features and use motion-centric features $F'$ to refine SMPL parameters. Finally, ARTS feeds the refined SMPL parameters $\theta_{refined}$, $\beta_{refined}$ to the SMPL regressor to generate the human mesh.
  • Figure 3: Illustration of the bone-guided shape fitting. Analytics and Analytical-MLP are utilized to map bone length into the initial SMPL shape parameters.
  • Figure 4: Ablation study for different sequence lengths $T$ in terms of MPJPE (left) and Accel (right) on 3DPW dataset.
  • Figure 5: Qualitative comparison among GLoT (green mesh), PMCE (pink mesh) and our ARTS (blue mesh) on the challenging 3DPW dataset.