Table of Contents
Fetching ...

ECON: Explicit Clothed humans Optimized via Normal integration

Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, Michael J. Black

TL;DR

ECON addresses the challenge of reconstructing detailed clothed 3D humans from a single image by merging implicit surface flexibility with explicit SMPL-X body regularization. It predicts dense front/back clothing normals $\widehat{\mathcal{N}}^{\text{c}}_\text{F}, \widehat{\mathcal{N}}^{\text{c}}_\text{B}$, lifts them to 2.5D surfaces via depth-aware d-BiNI guided by $\mathcal{M}^\text{b}$, and completes the full geometry with IF-Nets+ conditioned on the body, followed by Poisson stitching. The approach yields high-fidelity clothed 3D humans in challenging poses and with loose clothing, outperforming prior methods on CAPE and Renderpeople in both quantitative metrics (Chamfer, P2S, normals) and perceptual realism, and it supports multi-person reconstruction under occlusions. The work demonstrates strong generalization, robustness to pose/clothing topology, and provides code/models to enable research and downstream applications in avatar creation and metaverse pipelines.

Abstract

The combination of deep learning, artist-curated scans, and Implicit Functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry, but produce disembodied limbs or degenerate shapes for novel poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit representation and explicit body regularization. To this end, we make two key observations: (1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a "canvas" for stitching together detailed surface patches. Based on these, our method, ECON, has three main steps: (1) It infers detailed 2D normal maps for the front and back side of a clothed person. (2) From these, it recovers 2.5D front and back surfaces, called d-BiNI, that are equally detailed, yet incomplete, and registers these w.r.t. each other with the help of a SMPL-X body mesh recovered from the image. (3) It "inpaints" the missing geometry between d-BiNI surfaces. If the face and hands are noisy, they can optionally be replaced with the ones of SMPL-X. As a result, ECON infers high-fidelity 3D humans even in loose clothes and challenging poses. This goes beyond previous methods, according to the quantitative evaluation on the CAPE and Renderpeople datasets. Perceptual studies also show that ECON's perceived realism is better by a large margin. Code and models are available for research purposes at econ.is.tue.mpg.de

ECON: Explicit Clothed humans Optimized via Normal integration

TL;DR

ECON addresses the challenge of reconstructing detailed clothed 3D humans from a single image by merging implicit surface flexibility with explicit SMPL-X body regularization. It predicts dense front/back clothing normals , lifts them to 2.5D surfaces via depth-aware d-BiNI guided by , and completes the full geometry with IF-Nets+ conditioned on the body, followed by Poisson stitching. The approach yields high-fidelity clothed 3D humans in challenging poses and with loose clothing, outperforming prior methods on CAPE and Renderpeople in both quantitative metrics (Chamfer, P2S, normals) and perceptual realism, and it supports multi-person reconstruction under occlusions. The work demonstrates strong generalization, robustness to pose/clothing topology, and provides code/models to enable research and downstream applications in avatar creation and metaverse pipelines.

Abstract

The combination of deep learning, artist-curated scans, and Implicit Functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry, but produce disembodied limbs or degenerate shapes for novel poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit representation and explicit body regularization. To this end, we make two key observations: (1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a "canvas" for stitching together detailed surface patches. Based on these, our method, ECON, has three main steps: (1) It infers detailed 2D normal maps for the front and back side of a clothed person. (2) From these, it recovers 2.5D front and back surfaces, called d-BiNI, that are equally detailed, yet incomplete, and registers these w.r.t. each other with the help of a SMPL-X body mesh recovered from the image. (3) It "inpaints" the missing geometry between d-BiNI surfaces. If the face and hands are noisy, they can optionally be replaced with the ones of SMPL-X. As a result, ECON infers high-fidelity 3D humans even in loose clothes and challenging poses. This goes beyond previous methods, according to the quantitative evaluation on the CAPE and Renderpeople datasets. Perceptual studies also show that ECON's perceived realism is better by a large margin. Code and models are available for research purposes at econ.is.tue.mpg.de
Paper Structure (20 sections, 11 equations, 19 figures, 4 tables)

This paper contains 20 sections, 11 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Human digitization from a color image.ECON combines the best aspects of free-form implicit representation, and explicit anthropomorphic regularization to infer high-fidelity 3D humans, even with loose clothing or in challenging poses. It does so in three steps: (1) It infers detailed 2D normal maps for the front and back side (\ref{['sec: detailed normal prediction']}). (2) The normal maps are converted into detailed, yet incomplete, 2.5D front and back surfaces guided by a SMPL-X estimate (\ref{['sec: front and back surface reconstruction']}). (3) It then "inpaints" the missing geometry between two surfaces (\ref{['sec: human shape completion']}). Face or hands can be optionally replaced with the cleaner ones from SMPL-X. See the https://econ.is.tue.mpg.de for more results.
  • Figure 2: Summary of SOTA.PIFuHDsaito2020pifuhd recovers clothing details, but struggles with novel poses. ICONxiu2022icon and PaMIRzheng2020pamir regularize shape to a body shape, but over-constrain the skirts, or over-smooth the wrinkles. ECON combines their best aspects.
  • Figure 3: Overview. ECON takes as input an RGB image, $\bm{\mathcal{I}}$, and a SMPL-X body, $\bm{\mathcal{M}^\text{b}}$. Conditioned on the rendered front and back body normal images, $\bm{\mathcal{N}^{\text{b}}}$, ECON first predicts front and back clothing normal maps, $\bm{\widehat{\mathcal{N}}^{\text{c}}}$. These two maps, along with body depth maps, $\bm{\mathcal{Z}^{\text{b}}}$, are fed into a d-BiNI optimizer to produce front and back surfaces, $\{\bm{\textcolor{frontcolor}{\mathcal{M_F}},\textcolor{backcolor}{\mathcal{M_B}}}\}$. Based on such partial surfaces, and body estimate $\bm{\mathcal{M}^\text{b}}$, IF-Nets+ implicitly completes $\bm{\textcolor{sidecolor}{\mathcal{R}_IF}}$. With optional Face or hands from $\bm{\mathcal{M}^\text{b}}$, screened Poisson combines everything as final watertight $\bm{\mathcal{R}}$.
  • Figure 4: Four inputs to d-BiNI.$\Omega_{\text{n}}$ and $\Omega_{\text{z}}$ are the domains of clothed and body regions, respectively. $\partial\Omega_{\text{n}}$ is the silhouette of $\Omega_{\text{n}}$.
  • Figure 5: "Inpainting" the missing geometry. We simulate different cases of occlusion by masking the normal images and present the intermediate and final 3D reconstruction of different design choices. While IF-Nets misses certain body parts, IF-Nets+ produces a plausible overall shape. $\hbox{ECON}\xspace_{\text{IF}}$ produces more consistent clothing surfaces than $\hbox{ECON}\xspace_{\text{EX}}$ due to a learned shape distribution.
  • ...and 14 more figures