BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

Shengze Wang; Jiefeng Li; Tianye Li; Ye Yuan; Henry Fuchs; Koki Nagano; Shalini De Mello; Michael Stengel

BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

Shengze Wang, Jiefeng Li, Tianye Li, Ye Yuan, Henry Fuchs, Koki Nagano, Shalini De Mello, Michael Stengel

TL;DR

BLADE tackles single-view 3D human mesh recovery under perspective distortion by decoupling the Z-translation $T_z$ from other parameters. It first estimates $T_z$ from a cropped image using a pelvis-depth predictor, then performs $T_z$-conditioned SMPL-X pose/shape estimation, and finally recovers focal length and XY-translation through differentiable rasterization. A large-scale Bedlam-cc synthetic dataset is introduced to cover challenging close-range depths, enabling robust $T_z$ estimation. Across SPEC-MTP, PDHuman, HuMMaN, and Bedlam-cc, BLADE achieves state-of-the-art performance in depth, camera parameters, 3D pose, and 2D alignment, particularly for close-range imagery. This perspective-aware framework advances the accuracy and reliability of single-image 3D human pose estimation and data labeling for real-world applications.

Abstract

Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera. Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters. To resolve this long-standing challenge, we present our method BLADE which accurately recovers perspective parameters from a single image without heuristic assumptions. We start from the inverse relationship between perspective distortion and the person's Z-translation Tz, and we show that Tz can be reliably estimated from the image. We then discuss the important role of Tz for accurate human mesh recovery estimated from close-range images. Finally, we show that, once Tz and the 3D human mesh are estimated, one can accurately recover the focal length and full 3D translation. Extensive experiments on standard benchmarks and real-world close-range images show that our method is the first to accurately recover projection parameters from a single image, and consequently attain state-of-the-art accuracy on 3D pose estimation and 2D alignment for a wide range of images. https://research.nvidia.com/labs/amri/projects/blade/

BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

TL;DR

Abstract

BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)