Table of Contents
Fetching ...

BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

Shengze Wang, Jiefeng Li, Tianye Li, Ye Yuan, Henry Fuchs, Koki Nagano, Shalini De Mello, Michael Stengel

TL;DR

BLADE tackles single-view 3D human mesh recovery under perspective distortion by decoupling the Z-translation $T_z$ from other parameters. It first estimates $T_z$ from a cropped image using a pelvis-depth predictor, then performs $T_z$-conditioned SMPL-X pose/shape estimation, and finally recovers focal length and XY-translation through differentiable rasterization. A large-scale Bedlam-cc synthetic dataset is introduced to cover challenging close-range depths, enabling robust $T_z$ estimation. Across SPEC-MTP, PDHuman, HuMMaN, and Bedlam-cc, BLADE achieves state-of-the-art performance in depth, camera parameters, 3D pose, and 2D alignment, particularly for close-range imagery. This perspective-aware framework advances the accuracy and reliability of single-image 3D human pose estimation and data labeling for real-world applications.

Abstract

Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera. Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters. To resolve this long-standing challenge, we present our method BLADE which accurately recovers perspective parameters from a single image without heuristic assumptions. We start from the inverse relationship between perspective distortion and the person's Z-translation Tz, and we show that Tz can be reliably estimated from the image. We then discuss the important role of Tz for accurate human mesh recovery estimated from close-range images. Finally, we show that, once Tz and the 3D human mesh are estimated, one can accurately recover the focal length and full 3D translation. Extensive experiments on standard benchmarks and real-world close-range images show that our method is the first to accurately recover projection parameters from a single image, and consequently attain state-of-the-art accuracy on 3D pose estimation and 2D alignment for a wide range of images. https://research.nvidia.com/labs/amri/projects/blade/

BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

TL;DR

BLADE tackles single-view 3D human mesh recovery under perspective distortion by decoupling the Z-translation from other parameters. It first estimates from a cropped image using a pelvis-depth predictor, then performs -conditioned SMPL-X pose/shape estimation, and finally recovers focal length and XY-translation through differentiable rasterization. A large-scale Bedlam-cc synthetic dataset is introduced to cover challenging close-range depths, enabling robust estimation. Across SPEC-MTP, PDHuman, HuMMaN, and Bedlam-cc, BLADE achieves state-of-the-art performance in depth, camera parameters, 3D pose, and 2D alignment, particularly for close-range imagery. This perspective-aware framework advances the accuracy and reliability of single-image 3D human pose estimation and data labeling for real-world applications.

Abstract

Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera. Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters. To resolve this long-standing challenge, we present our method BLADE which accurately recovers perspective parameters from a single image without heuristic assumptions. We start from the inverse relationship between perspective distortion and the person's Z-translation Tz, and we show that Tz can be reliably estimated from the image. We then discuss the important role of Tz for accurate human mesh recovery estimated from close-range images. Finally, we show that, once Tz and the 3D human mesh are estimated, one can accurately recover the focal length and full 3D translation. Extensive experiments on standard benchmarks and real-world close-range images show that our method is the first to accurately recover projection parameters from a single image, and consequently attain state-of-the-art accuracy on 3D pose estimation and 2D alignment for a wide range of images. https://research.nvidia.com/labs/amri/projects/blade/

Paper Structure

This paper contains 26 sections, 13 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Our method enables accurate human mesh and camera parameter estimation for single-view in-the-wild images including close-ups with high levels of perspective distortion (pelvis depth $T_z$ shown in meters).
  • Figure 2: Pose error introduced by camera heuristics. (1,2) Previous methods estimate the pose of the person from image crops, leading to pose inaccuracy compared to the ground truth (left). (3) Focal length and 3D translation $(f,T)$ are heuristically converted from a 2D affine transformation $(s,t_x,t_y)$, which is only suitable from afar but not for close-range images. (4) Due to the incorrect pose and perspective parameters, the final estimation is inaccurate.
  • Figure 3: Influence of $T_z$ on perspective distortion. A person is captured with different focal length and Z-translation $T_z$ from the camera. (b&d) Changing the focal length from a short lens $f_1$ to a long lens $f_2$ changes the zoom factor but does not change the perspective distortion, as shown by the equivalence between (c) and (d). (a) Changing the Z-translation by a $\Delta T_z$ changes the level of perspective distortion in the image. This effect is particularly pronounced for close-range imagery (blue curve). See Sec. \ref{['sec:preliminary']} for detailed discussion.
  • Figure 4: Overview. Starting with a bounding box image crop $I_{crop}$ of the person, the Pelvis Depth Estimator $F^{T_z}$(green box) estimates the Z-translation of the person's pelvis, $T_z$. Then, the Pose Estimator $F^{pose}$(blue box) estimates SMPL-X shape and pose ($\beta_{}$, $\theta_{}$) from the full input image while considering the image distortion induced by $T_z$. Finally, through differentiable rasterization, the Camera Solver(brown box) recovers the optimal focal length and 3D translations that best aligns the rasterized SMPL-X mesh with the segmented mask of the person. We are thus able to solve for the full perspective projection model without heuristic assumptions.
  • Figure 5: Solving for $\mathbf{(f,T_x,T_y):}$ (a) With initial $(f,T_x,T_y)=[h,0,0]$, the estimated $T_z$ and human mesh parameters $(\beta_{}, \theta_{})$, the optimal $(f,T_x,T_y,T_z)$ is derived (b) by optimizing the image space alignment through differentiable rasterization laine2020modular. (c) The optimized parameters correctly align the projected 3D human mesh to the person in the image.
  • ...and 12 more figures