Table of Contents
Fetching ...

Distribution and Depth-Aware Transformers for 3D Human Mesh Recovery

Jerrin Bright, Bavesh Balaji, Harish Prakash, Yuhao Chen, David A Clausi, John Zelek

TL;DR

Distribution and depth-aware human mesh recovery (D2A-HMR), an end-to-end transformer architecture meticulously designed to minimize the disparity between distributions and incorporate scene-depth leveraging prior depth information is introduced.

Abstract

Precise Human Mesh Recovery (HMR) with in-the-wild data is a formidable challenge and is often hindered by depth ambiguities and reduced precision. Existing works resort to either pose priors or multi-modal data such as multi-view or point cloud information, though their methods often overlook the valuable scene-depth information inherently present in a single image. Moreover, achieving robust HMR for out-of-distribution (OOD) data is exceedingly challenging due to inherent variations in pose, shape and depth. Consequently, understanding the underlying distribution becomes a vital subproblem in modeling human forms. Motivated by the need for unambiguous and robust human modeling, we introduce Distribution and depth-aware human mesh recovery (D2A-HMR), an end-to-end transformer architecture meticulously designed to minimize the disparity between distributions and incorporate scene-depth leveraging prior depth information. Our approach demonstrates superior performance in handling OOD data in certain scenarios while consistently achieving competitive results against state-of-the-art HMR methods on controlled datasets.

Distribution and Depth-Aware Transformers for 3D Human Mesh Recovery

TL;DR

Distribution and depth-aware human mesh recovery (D2A-HMR), an end-to-end transformer architecture meticulously designed to minimize the disparity between distributions and incorporate scene-depth leveraging prior depth information is introduced.

Abstract

Precise Human Mesh Recovery (HMR) with in-the-wild data is a formidable challenge and is often hindered by depth ambiguities and reduced precision. Existing works resort to either pose priors or multi-modal data such as multi-view or point cloud information, though their methods often overlook the valuable scene-depth information inherently present in a single image. Moreover, achieving robust HMR for out-of-distribution (OOD) data is exceedingly challenging due to inherent variations in pose, shape and depth. Consequently, understanding the underlying distribution becomes a vital subproblem in modeling human forms. Motivated by the need for unambiguous and robust human modeling, we introduce Distribution and depth-aware human mesh recovery (D2A-HMR), an end-to-end transformer architecture meticulously designed to minimize the disparity between distributions and incorporate scene-depth leveraging prior depth information. Our approach demonstrates superior performance in handling OOD data in certain scenarios while consistently achieving competitive results against state-of-the-art HMR methods on controlled datasets.
Paper Structure (11 sections, 5 equations, 4 figures, 6 tables)

This paper contains 11 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of our main idea. (a) Overview of the proposed D2A-HMR approach (b) Our method, D2A-HMR improves the mesh-image alignment (particularly as visualized in the highlighted region) when compared against SPIN spin, PARE pare and METRO metro.
  • Figure 2: D2A-HMR model architecture. Given an image ($\textbf{I}$), we first incorporate a transformer backbone ($\textbf{E}$) to estimate the depth map ($\textbf{D}$) and a CNN backbone ($\textbf{F}$) to extract the features from the images. Positional embedding is applied to both image and pseudo-depth features, utilizing a hybrid approach for image tokens ($z_{img}$) and pseudo-depth tokens ($z_{depth}$). Self-attention is performed on $z_{img}$ and $z_{depth}$, resulting in $z_{img}'$ and $z_{depth}'$, respectively. Subsequently, cross-attention is applied between $z_{img}'$ and $z_{depth}'$ to produce $z_c$. The learnable fusion gates combine $z_{img}'$, $z_{depth}'$, and $z_c$, followed by layer normalization and an MLP. The resulting gated tokens ($z$) are input into three distinct refinement modules: a decoder ($\textbf{D}$) for silhouette estimation, a regressor head, $\textbf{R}$ which incorporates normalizing flow ($\textbf{DM}$) for distribution-aware joint vertex estimation and masked modeling for enhanced semantic representation of the features.
  • Figure 3: Qualitative results. Inferred SMPL mesh reconstruction on the MLBPitchDB baseball dataset mitigatingblur.
  • Figure 4: Qualitative results. Qualitative comparison of D2A-HMR with SPIN spin, PARE pare, METRO metro, ROMP romp and PyMAF pymaf on in-the-wild data from different sports dataset mitigatingbluricehockeylsp and unusual poses from the internet.