Table of Contents
Fetching ...

EVA3D: Compositional 3D Human Generation from 2D Image Collections

Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, Ziwei Liu

TL;DR

Problem addressed: generating high-quality, animatable 3D humans from sparse 2D image collections is challenging due to articulation and pose/view diversity. The main approach, EVA3D, introduces a compositional NeRF with 16 body-part subnetworks and SMPL-based priors, plus a pose-guided sampling strategy to learn high-resolution 3D humans without 3D supervision. Key contributions include achieving native high-resolution ($512\times256$) 3D human generation from 2D data, an efficient compositional NeRF representation, delta SDF with SMPL guidance, strong quantitative/qualitative results across four fashion datasets, and capabilities for interpolation and inversion. This work advances inverse graphics for scalable, data-efficient 3D human synthesis with potential impact on AR/VR/VFX pipelines and downstream tasks.

Abstract

Inverse graphics aims to recover 3D models from 2D observations. Utilizing differentiable rendering, recent 3D-aware generative models have shown impressive results of rigid object generation using 2D images. However, it remains challenging to generate articulated objects, like human bodies, due to their complexity and diversity in poses and appearances. In this work, we propose, EVA3D, an unconditional 3D human generative model learned from 2D image collections only. EVA3D can sample 3D humans with detailed geometry and render high-quality images (up to 512x256) without bells and whistles (e.g. super resolution). At the core of EVA3D is a compositional human NeRF representation, which divides the human body into local parts. Each part is represented by an individual volume. This compositional representation enables 1) inherent human priors, 2) adaptive allocation of network parameters, 3) efficient training and rendering. Moreover, to accommodate for the characteristics of sparse 2D human image collections (e.g. imbalanced pose distribution), we propose a pose-guided sampling strategy for better GAN learning. Extensive experiments validate that EVA3D achieves state-of-the-art 3D human generation performance regarding both geometry and texture quality. Notably, EVA3D demonstrates great potential and scalability to "inverse-graphics" diverse human bodies with a clean framework.

EVA3D: Compositional 3D Human Generation from 2D Image Collections

TL;DR

Problem addressed: generating high-quality, animatable 3D humans from sparse 2D image collections is challenging due to articulation and pose/view diversity. The main approach, EVA3D, introduces a compositional NeRF with 16 body-part subnetworks and SMPL-based priors, plus a pose-guided sampling strategy to learn high-resolution 3D humans without 3D supervision. Key contributions include achieving native high-resolution () 3D human generation from 2D data, an efficient compositional NeRF representation, delta SDF with SMPL guidance, strong quantitative/qualitative results across four fashion datasets, and capabilities for interpolation and inversion. This work advances inverse graphics for scalable, data-efficient 3D human synthesis with potential impact on AR/VR/VFX pipelines and downstream tasks.

Abstract

Inverse graphics aims to recover 3D models from 2D observations. Utilizing differentiable rendering, recent 3D-aware generative models have shown impressive results of rigid object generation using 2D images. However, it remains challenging to generate articulated objects, like human bodies, due to their complexity and diversity in poses and appearances. In this work, we propose, EVA3D, an unconditional 3D human generative model learned from 2D image collections only. EVA3D can sample 3D humans with detailed geometry and render high-quality images (up to 512x256) without bells and whistles (e.g. super resolution). At the core of EVA3D is a compositional human NeRF representation, which divides the human body into local parts. Each part is represented by an individual volume. This compositional representation enables 1) inherent human priors, 2) adaptive allocation of network parameters, 3) efficient training and rendering. Moreover, to accommodate for the characteristics of sparse 2D human image collections (e.g. imbalanced pose distribution), we propose a pose-guided sampling strategy for better GAN learning. Extensive experiments validate that EVA3D achieves state-of-the-art 3D human generation performance regarding both geometry and texture quality. Notably, EVA3D demonstrates great potential and scalability to "inverse-graphics" diverse human bodies with a clean framework.
Paper Structure (20 sections, 7 equations, 15 figures, 3 tables)

This paper contains 20 sections, 7 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: EVA3D generates high-quality and diverse 3D humans with photo-realistic RGB renderings and detailed geometry. Only 2D image collections are used for training.
  • Figure 2: Rendering Process of the Compositional Human NeRF Representation. For shape and pose specified by SMPL($\bm{\beta}$, $\bm{\theta}$), local bounding boxes are constructed. Rays that intersect with bounding boxes are sampled and transferred to the canonical space using inverse LBS. Subnetworks corresponding to bounding boxes are queried, results of which are integrated to produce final renderings.
  • Figure 3: 3D Human GAN Framework. With the estimated SMPL and camera parameters distribution $p_{est}$, 3D humans are randomly sampled and rendered conditioned on $z\sim p_{z}$. The renderings are used for adversarial training against real 2D human image collections $p_{real}$.
  • Figure 4: Generation Results of EVA3D. The 3D-aware nature and inherent human prior of EVA3D enable explicit control over rendering views, human poses, and shapes.
  • Figure 5: Qualitative Comparison Between EVA3D and Baseline Methods. Rendered 2D images and corresponding meshes are placed side-by-side. Both the 2D renderings and 3D meshes generated by our method achieve the best quality among SOTA methods. Zoom in for the best view.
  • ...and 10 more figures