Table of Contents
Fetching ...

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

Vandad Davoodnia, Saeed Ghorbani, Alexandre Messier, Ali Etemad

TL;DR

SkelFormer targets markerless multi-view 3D human pose and body shape estimation by decoupling 3D keypoint detection from inverse-kinematics. It combines a 3D keypoint estimator (DLT triangulation of 2D detections) with a skeletal transformer that maps noisy joint positions to SMPL pose and shape, aided by a synthetic-aligned joint regressor and extensive data augmentations. The method demonstrates strong in-distribution performance and competitive out-of-distribution results, with robust handling of occlusions and sensor noise and significantly faster runtime than traditional optimization-based IK. Through ablations, the paper highlights the importance of joint-aware attention, symmetric orthogonalization, and augmentation strategies for generalization. Overall, SkelFormer advances practical, fast, and robust markerless motion capture across diverse environments and datasets.

Abstract

We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations. This module integrates prior knowledge about pose space and infers the full pose state at runtime. Separating the 3D keypoint detection and inverse-kinematic problems, along with the expressive representations learned by our skeletal transformer, enhance the generalization of our method to unseen noisy data. We evaluate our method on three public datasets in both in-distribution and out-of-distribution settings using three datasets, and observe strong performance with respect to prior works. Moreover, ablation experiments demonstrate the impact of each of the modules of our architecture. Finally, we study the performance of our method in dealing with noise and heavy occlusions and find considerable robustness with respect to other solutions.

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

TL;DR

SkelFormer targets markerless multi-view 3D human pose and body shape estimation by decoupling 3D keypoint detection from inverse-kinematics. It combines a 3D keypoint estimator (DLT triangulation of 2D detections) with a skeletal transformer that maps noisy joint positions to SMPL pose and shape, aided by a synthetic-aligned joint regressor and extensive data augmentations. The method demonstrates strong in-distribution performance and competitive out-of-distribution results, with robust handling of occlusions and sensor noise and significantly faster runtime than traditional optimization-based IK. Through ablations, the paper highlights the importance of joint-aware attention, symmetric orthogonalization, and augmentation strategies for generalization. Overall, SkelFormer advances practical, fast, and robust markerless motion capture across diverse environments and datasets.

Abstract

We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations. This module integrates prior knowledge about pose space and infers the full pose state at runtime. Separating the 3D keypoint detection and inverse-kinematic problems, along with the expressive representations learned by our skeletal transformer, enhance the generalization of our method to unseen noisy data. We evaluate our method on three public datasets in both in-distribution and out-of-distribution settings using three datasets, and observe strong performance with respect to prior works. Moreover, ablation experiments demonstrate the impact of each of the modules of our architecture. Finally, we study the performance of our method in dealing with noise and heavy occlusions and find considerable robustness with respect to other solutions.
Paper Structure (23 sections, 3 equations, 8 figures, 3 tables)

This paper contains 23 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An overview of the proposed skeletal transformer pipeline is demonstrated. During training, noisy 3D keypoints are generated using our joint regressor, while during inference, 3D keypoint are provided by off-the-shelf models. Our proposed skeletal transformer then maps the keypoints onto the SMPL pose and shape parameters.
  • Figure 2: Detailed architectures of our joint encoder, pose decoder, and shape decoder modules are presented.
  • Figure 3: A visual comparison with the pseudo-ground-truth from moon2022neuralannot is provided, presenting the realism and accuracy of our SkelFormer.
  • Figure 4: The fitting performance of VPoser is demonstrated while using (a) our proposed joint regressor; and (b) the joint regressor from Moon et al.moon2022neuralannot.
  • Figure 5: Robustness of our skeletal transformer is highlighted in the presence of different levels of noise and occlusion by comparing it against VPoser and VPoser-t.
  • ...and 3 more figures