Table of Contents
Fetching ...

Edify 3D: Scalable High-Quality 3D Asset Generation

NVIDIA, :, Maciej Bala, Yin Cui, Yifan Ding, Yunhao Ge, Zekun Hao, Jon Hasselgren, Jacob Huffman, Jingyi Jin, J. P. Lewis, Zhaoshuo Li, Chen-Hsuan Lin, Yen-Chen Lin, Tsung-Yi Lin, Ming-Yu Liu, Alice Luo, Qianli Ma, Jacob Munkberg, Stella Shi, Fangyin Wei, Donglai Xiang, Jiashu Xu, Xiaohui Zeng, Qinsheng Zhang

TL;DR

This work introduces Edify 3D, an advanced solution designed for high-quality 3D asset generation that can generate high-quality 3D assets with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2 minutes of runtime.

Abstract

We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our method first synthesizes RGB and surface normal images of the described object at multiple viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the shape, texture, and PBR materials of the object. Our method can generate high-quality 3D assets with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2 minutes of runtime.

Edify 3D: Scalable High-Quality 3D Asset Generation

TL;DR

This work introduces Edify 3D, an advanced solution designed for high-quality 3D asset generation that can generate high-quality 3D assets with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2 minutes of runtime.

Abstract

We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our method first synthesizes RGB and surface normal images of the described object at multiple viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the shape, texture, and PBR materials of the object. Our method can generate high-quality 3D assets with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2 minutes of runtime.

Paper Structure

This paper contains 14 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Edify 3D is a model designed for high-quality 3D asset generation. With input text prompts and/or a reference image, our model can generate a wide range of detailed 3D assets, supporting applications such as video game design, extended reality, simulation, and more.
  • Figure 2: Pipeline of Edify 3D. Given a text description, a multi-view diffusion model synthesizes the RGB appearance of the described object. The generated multi-view RGB images are then used as a condition to synthesize surface normals using a multi-view ControlNet zhang2023adding. Next, a reconstruction model takes the multi-view RGB and normal images as input and predicts the neural 3D representation using a set of latent tokens. This is followed by isosurface extraction and subsequent mesh post-processing to obtain the mesh geometry. An upscaling ControlNet is used to increase the texture resolution, conditioning on mesh rasterizations to generate high-resolution multi-view RGB images, which are then back-projected onto the texture map.
  • Figure 3: Cross-view attention. In standard diffusion models, each view is synthesized by the diffusion model independently. We extend the self-attention layer (yellow boxes) in our multi-view diffusion models to attend across other viewpoints using the same weights.
  • Figure 4: Comparison of number of sampled views. All images are sampled from the same model. Our multi-view diffusion model can synthesize object images with dense viewpoint coverage while maintaining good multi-view consistency, making it suitable for the downstream reconstruction model.
  • Figure 5: Comparison of number of training views. We compare two models trained primarily with different numbers of views (4 vs. 8), and sample images at the same 10 views at inference time. The model trained primarily on 8 views generates images with better multi-view consistency compared to the 4-view counterpart.
  • ...and 6 more figures