Table of Contents
Fetching ...

Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation

István Sárándi, Gerard Pons-Moll

TL;DR

The paper tackles the challenge of 3D human pose and shape estimation from single RGB images when training data come with heterogeneous annotations. It introduces Neural Localizer Fields (NLF), a neural field that assigns a localizer function to every point in a canonical body volume, enabling on-demand 3D localization of any point by modulating a lightweight heatmap head. It also provides an efficient post-processing method to fit SMPL-family models to nonparametric predictions, producing compact parametric representations suitable for downstream tasks. Across benchmarks such as 3DPW, EMDB, EHF, SSP-3D, and AGORA, NLF achieves state-of-the-art performance, robust generalization, and real-time inference, while enabling large-scale, mixed-dataset training without re-annotation.

Abstract

With the explosive growth of available training data, single-image 3D human modeling is ahead of a transition to a data-centric paradigm. A key to successfully exploiting data scale is to design flexible models that can be supervised from various heterogeneous data sources produced by different researchers or vendors. To this end, we propose a simple yet powerful paradigm for seamlessly unifying different human pose and shape-related tasks and datasets. Our formulation is centered on the ability -- both at training and test time -- to query any arbitrary point of the human volume, and obtain its estimated location in 3D. We achieve this by learning a continuous neural field of body point localizer functions, each of which is a differently parameterized 3D heatmap-based convolutional point localizer (detector). For generating parametric output, we propose an efficient post-processing step for fitting SMPL-family body models to nonparametric joint and vertex predictions. With this approach, we can naturally exploit differently annotated data sources including mesh, 2D/3D skeleton and dense pose, without having to convert between them, and thereby train large-scale 3D human mesh and skeleton estimation models that considerably outperform the state-of-the-art on several public benchmarks including 3DPW, EMDB, EHF, SSP-3D and AGORA.

Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation

TL;DR

The paper tackles the challenge of 3D human pose and shape estimation from single RGB images when training data come with heterogeneous annotations. It introduces Neural Localizer Fields (NLF), a neural field that assigns a localizer function to every point in a canonical body volume, enabling on-demand 3D localization of any point by modulating a lightweight heatmap head. It also provides an efficient post-processing method to fit SMPL-family models to nonparametric predictions, producing compact parametric representations suitable for downstream tasks. Across benchmarks such as 3DPW, EMDB, EHF, SSP-3D, and AGORA, NLF achieves state-of-the-art performance, robust generalization, and real-time inference, while enabling large-scale, mixed-dataset training without re-annotation.

Abstract

With the explosive growth of available training data, single-image 3D human modeling is ahead of a transition to a data-centric paradigm. A key to successfully exploiting data scale is to design flexible models that can be supervised from various heterogeneous data sources produced by different researchers or vendors. To this end, we propose a simple yet powerful paradigm for seamlessly unifying different human pose and shape-related tasks and datasets. Our formulation is centered on the ability -- both at training and test time -- to query any arbitrary point of the human volume, and obtain its estimated location in 3D. We achieve this by learning a continuous neural field of body point localizer functions, each of which is a differently parameterized 3D heatmap-based convolutional point localizer (detector). For generating parametric output, we propose an efficient post-processing step for fitting SMPL-family body models to nonparametric joint and vertex predictions. With this approach, we can naturally exploit differently annotated data sources including mesh, 2D/3D skeleton and dense pose, without having to convert between them, and thereby train large-scale 3D human mesh and skeleton estimation models that considerably outperform the state-of-the-art on several public benchmarks including 3DPW, EMDB, EHF, SSP-3D and AGORA.
Paper Structure (52 sections, 8 equations, 11 figures, 12 tables, 1 algorithm)

This paper contains 52 sections, 8 equations, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: Can one model learn to localize any point of the human body in 3D from a single RGB image? We propose to build a generalist human pose and shape estimator that can readily learn from any annotated points at training time and can estimate any user-chosen points at test time.
  • Figure 2: Overview of NLF. Given image features $\mathcal{F}$ and any arbitrarily chosen 3D query point $\mathbf{p}$ within the canonical human volume, we aim to estimate the observation-space 3D point $\mathbf{p}^\prime$. To control which point gets estimated, we dynamically modulate a convolutional layer at the output, to produce heatmaps for the requested point. We achieve this modulation by predicting the convolutional weights through a neural field. During training, the points $\mathbf{p}$ can be picked per training example based on whichever points are annotated for it, allowing natural dataset mixing. At test time, the model can flexibly estimate any surface point and any skeletal joint inside the body volume, as required.
  • Figure 3: Qualitative result on SSP-3D.left: NLF's nonparametric output (front and side view), right: result of our proposed fast SMPL fitting algorithm (front and side). Our nonparametric prediction already has high quality, allowing us to use a simple and efficient fitting algorithm to obtain body model parameters that faithfully represent the nonparametric output.
  • Figure 4: Customizable point localization. By selecting points $\mathbf{p}$ in the continous canonical space, we can predict any landmark set both at training and test time. The first column depicts the query points we estimate: SMPL(-X) joints and vertices, COCO joints, Human3.6M joints, and arbitrary points sampled within the human volume. The fourth and seventh column show rotated views.
  • Figure 5: Uncertainty estimation results. High uncertainty is indicated in yellow, while low is shown in blue. Occluded body parts tend to have higher uncertainty prediction.
  • ...and 6 more figures