Table of Contents
Fetching ...

En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

Yifang Men, Biwen Lei, Yuan Yao, Miaomiao Cui, Zhouhui Lian, Xuansong Xie

TL;DR

En3D introduces a zero-shot framework for sculpting high-fidelity 3D human avatars from 2D synthetic data, avoiding reliance on pre-existing 3D or 2D datasets. The approach combines a 3D generative model trained on synthetic, view-balanced data with known camera parameters, a geometry sculptor that refines shape using multi-view guidance, and an explicit texturing module that disentangles textures via semantic UV partitioning and differentiable rendering. Inference integrates optimization modules to sharpen geometry and texture, enabling animation and editing. Empirical results show improved image quality, geometry accuracy, and content diversity, with strong capabilities for avatar animation, editing, and style adaptation.

Abstract

We present En3D, an enhanced generative scheme for sculpting high-quality 3D human avatars. Unlike previous works that rely on scarce 3D datasets or limited 2D collections with imbalanced viewing angles and imprecise pose priors, our approach aims to develop a zero-shot 3D generative scheme capable of producing visually realistic, geometrically accurate and content-wise diverse 3D humans without relying on pre-existing 3D or 2D assets. To address this challenge, we introduce a meticulously crafted workflow that implements accurate physical modeling to learn the enhanced 3D generative model from synthetic 2D data. During inference, we integrate optimization modules to bridge the gap between realistic appearances and coarse 3D shapes. Specifically, En3D comprises three modules: a 3D generator that accurately models generalizable 3D humans with realistic appearance from synthesized balanced, diverse, and structured human images; a geometry sculptor that enhances shape quality using multi-view normal constraints for intricate human anatomy; and a texturing module that disentangles explicit texture maps with fidelity and editability, leveraging semantical UV partitioning and a differentiable rasterizer. Experimental results show that our approach significantly outperforms prior works in terms of image quality, geometry accuracy and content diversity. We also showcase the applicability of our generated avatars for animation and editing, as well as the scalability of our approach for content-style free adaptation.

En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

TL;DR

En3D introduces a zero-shot framework for sculpting high-fidelity 3D human avatars from 2D synthetic data, avoiding reliance on pre-existing 3D or 2D datasets. The approach combines a 3D generative model trained on synthetic, view-balanced data with known camera parameters, a geometry sculptor that refines shape using multi-view guidance, and an explicit texturing module that disentangles textures via semantic UV partitioning and differentiable rendering. Inference integrates optimization modules to sharpen geometry and texture, enabling animation and editing. Empirical results show improved image quality, geometry accuracy, and content diversity, with strong capabilities for avatar animation, editing, and style adaptation.

Abstract

We present En3D, an enhanced generative scheme for sculpting high-quality 3D human avatars. Unlike previous works that rely on scarce 3D datasets or limited 2D collections with imbalanced viewing angles and imprecise pose priors, our approach aims to develop a zero-shot 3D generative scheme capable of producing visually realistic, geometrically accurate and content-wise diverse 3D humans without relying on pre-existing 3D or 2D assets. To address this challenge, we introduce a meticulously crafted workflow that implements accurate physical modeling to learn the enhanced 3D generative model from synthetic 2D data. During inference, we integrate optimization modules to bridge the gap between realistic appearances and coarse 3D shapes. Specifically, En3D comprises three modules: a 3D generator that accurately models generalizable 3D humans with realistic appearance from synthesized balanced, diverse, and structured human images; a geometry sculptor that enhances shape quality using multi-view normal constraints for intricate human anatomy; and a texturing module that disentangles explicit texture maps with fidelity and editability, leveraging semantical UV partitioning and a differentiable rasterizer. Experimental results show that our approach significantly outperforms prior works in terms of image quality, geometry accuracy and content diversity. We also showcase the applicability of our generated avatars for animation and editing, as well as the scalability of our approach for content-style free adaptation.
Paper Structure (13 sections, 6 equations, 8 figures, 2 tables)

This paper contains 13 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Given random noises or guided texts, our generative scheme can synthesize high-fidelity 3D human avatars that are visually realistic and geometrically accurate. These avatars can be seamlessly animated and easily edited. Our model is trained on 2D synthetic data without relying on any pre-existing 3D or 2D collections.
  • Figure 2: An overview of the proposed scheme, which consists of three modules: 3D generative modeling (3DGM), the geometric sculpting (GS) and the explicit texturing (ET). 3DGM using synthesized diverse, balanced and structured human image with accurate camera $\varphi$ to learn generalizable 3D humans with the triplane-based architecture. GS is integrated as an optimization module by utilizing multi-view normal constraints to refine and carve geometry details. ET utilizes UV partitioning and a differentiable rasterizer to disentangles explicit UV texture maps. Not only multi-view renderings but also realistic 3D models can be acquired for final results.
  • Figure 3: The visualized flowchart of our method that synthesize textured 3D human avatars from input noises, texts or images.
  • Figure 4: Results of synthesized 3D human avatars at $512^2$.
  • Figure 5: Qualitative comparison with three state-of-the-art methods: EVA3D hong2022eva3d, AG3D dong2023ag3d and EG3D chan2022efficient.
  • ...and 3 more figures