Table of Contents
Fetching ...

D-PoSE: Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation

Nikolaos Vasilikopoulos, Drosakis Drosakis, Antonis Argyros

TL;DR

The paper tackles monocular 3D human pose and shape estimation from a single RGB image, a challenging task due to depth ambiguities. It introduces D-PoSE, a one-stage CNN-based method that predicts human depth maps and body-part segmentation as intermediate representations to regressing SMPL-X pose and shape, trained purely on synthetic BEDLAM and AGORA data. D-PoSE achieves state-of-the-art results on real benchmarks 3DPW and EMDB while using only 81.2M parameters, dramatically fewer than transformer-based rivals like TokenHMR which use ~681M. The approach demonstrates strong generalization from synthetic data to real-world scenarios, with ablations confirming the depth intermediate representation enhances accuracy, and offers a simple, efficient foundation for future extensions including temporal modeling and larger backbones.

Abstract

We present D-PoSE (Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation), a one-stage method that estimates human pose and SMPL-X shape parameters from a single RGB image. Recent works use larger models with transformer backbones and decoders to improve the accuracy in human pose and shape (HPS) benchmarks. D-PoSE proposes a vision based approach that uses the estimated human depth-maps as an intermediate representation for HPS and leverages training with synthetic data and the ground-truth depth-maps provided with them for depth supervision during training. Although trained on synthetic datasets, D-PoSE achieves state-of-the-art performance on the real-world benchmark datasets, EMDB and 3DPW. Despite its simple lightweight design and the CNN backbone, it outperforms ViT-based models that have a number of parameters that is larger by almost an order of magnitude. D-PoSE code is available at: https://github.com/nvasilik/D-PoSE

D-PoSE: Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation

TL;DR

The paper tackles monocular 3D human pose and shape estimation from a single RGB image, a challenging task due to depth ambiguities. It introduces D-PoSE, a one-stage CNN-based method that predicts human depth maps and body-part segmentation as intermediate representations to regressing SMPL-X pose and shape, trained purely on synthetic BEDLAM and AGORA data. D-PoSE achieves state-of-the-art results on real benchmarks 3DPW and EMDB while using only 81.2M parameters, dramatically fewer than transformer-based rivals like TokenHMR which use ~681M. The approach demonstrates strong generalization from synthetic data to real-world scenarios, with ablations confirming the depth intermediate representation enhances accuracy, and offers a simple, efficient foundation for future extensions including temporal modeling and larger backbones.

Abstract

We present D-PoSE (Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation), a one-stage method that estimates human pose and SMPL-X shape parameters from a single RGB image. Recent works use larger models with transformer backbones and decoders to improve the accuracy in human pose and shape (HPS) benchmarks. D-PoSE proposes a vision based approach that uses the estimated human depth-maps as an intermediate representation for HPS and leverages training with synthetic data and the ground-truth depth-maps provided with them for depth supervision during training. Although trained on synthetic datasets, D-PoSE achieves state-of-the-art performance on the real-world benchmark datasets, EMDB and 3DPW. Despite its simple lightweight design and the CNN backbone, it outperforms ViT-based models that have a number of parameters that is larger by almost an order of magnitude. D-PoSE code is available at: https://github.com/nvasilik/D-PoSE
Paper Structure (20 sections, 15 equations, 6 figures, 3 tables)

This paper contains 20 sections, 15 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: D-Pose, the proposed 3D human Pose and Shape Estimation method receives a single RGB image as input (left), produces intermediate depth and part segmentation representations (middle, bottom and top, respectively) so as to deliver the 3D pose and shape of the imaged person. Despite entailing a small fraction of the parameters of current models, D-PoSE outperforms the current state of the art in 3D pose and shape estimation accuracy in the major relevant datasets (3DPW, EMDB).
  • Figure 2: The architecture of D-PoSE. Given an input image, features are extracted using a CNN. With these feature maps a human depth map and a part-segmentation map are estimated. The original features pass through a soft-attention mechanism which uses part-segmentation maps. The final features are concatenated with the bounding-box information and the depth features and are given as input to the regressor which estimates the 3D human pose and shape.
  • Figure 3: Left: Image sampled from 3DPW, Right: human depth-map estimated by our method.
  • Figure 4: Left: Ground-truth depth-pap visualized in grayscale (BEDLAM dataset). Right: Ground-Truth SMPL-X Mesh after rendering with part-segmentation (BEDLAM dataset).
  • Figure 5: Each image block represents: the input image (left); the part-segmentation estimation as an intermediate representation (middle-top); the human depth map as an intermediate representation (middle-bottom); the 3D HPS estimation of our method (right). The figure illustrates results from the 3DPW dataset (top left block) the EMDB test set (top right), synthetic image sampled from the BEDLAM validation set (bottom left) and from the RICH dataset (bottom right).
  • ...and 1 more figures