Table of Contents
Fetching ...

Envision3D: One Image to 3D with Anchor Views Interpolation

Yatian Pang, Tanghui Jia, Yujun Shi, Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Xing Zhou, Francis E. H. Tay, Li Yuan

TL;DR

Envision3D tackles the challenge of deriving high-quality 3D content from a single image by generating a dense set of multi-view images with strong 3D priors. It introduces a cascade diffusion framework that first creates globally consistent anchor views conditioned on image-normal pairs, then interpolates a dense set of views using a fine-tuned video diffusion model, followed by a coarse-to-fine textured mesh extraction. The approach yields 32 consistent views and superior texture and geometry quality on the GSO dataset and other images, outperforming baselines such as Zero123, SyncDreamer, and Wonder3D. This work enables efficient, high-fidelity 3D content creation from a single image with potential impact on VR, gaming, and robotics pipelines.

Abstract

We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. Recent methods that extract 3D content from multi-view images generated by diffusion models show great potential. However, it is still challenging for diffusion models to generate dense multi-view consistent images, which is crucial for the quality of 3D content extraction. To address this issue, we propose a novel cascade diffusion framework, which decomposes the challenging dense views generation task into two tractable stages, namely anchor views generation and anchor views interpolation. In the first stage, we train the image diffusion model to generate global consistent anchor views conditioning on image-normal pairs. Subsequently, leveraging our video diffusion model fine-tuned on consecutive multi-view images, we conduct interpolation on the previous anchor views to generate extra dense views. This framework yields dense, multi-view consistent images, providing comprehensive 3D information. To further enhance the overall generation quality, we introduce a coarse-to-fine sampling strategy for the reconstruction algorithm to robustly extract textured meshes from the generated dense images. Extensive experiments demonstrate that our method is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.

Envision3D: One Image to 3D with Anchor Views Interpolation

TL;DR

Envision3D tackles the challenge of deriving high-quality 3D content from a single image by generating a dense set of multi-view images with strong 3D priors. It introduces a cascade diffusion framework that first creates globally consistent anchor views conditioned on image-normal pairs, then interpolates a dense set of views using a fine-tuned video diffusion model, followed by a coarse-to-fine textured mesh extraction. The approach yields 32 consistent views and superior texture and geometry quality on the GSO dataset and other images, outperforming baselines such as Zero123, SyncDreamer, and Wonder3D. This work enables efficient, high-fidelity 3D content creation from a single image with potential impact on VR, gaming, and robotics pipelines.

Abstract

We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. Recent methods that extract 3D content from multi-view images generated by diffusion models show great potential. However, it is still challenging for diffusion models to generate dense multi-view consistent images, which is crucial for the quality of 3D content extraction. To address this issue, we propose a novel cascade diffusion framework, which decomposes the challenging dense views generation task into two tractable stages, namely anchor views generation and anchor views interpolation. In the first stage, we train the image diffusion model to generate global consistent anchor views conditioning on image-normal pairs. Subsequently, leveraging our video diffusion model fine-tuned on consecutive multi-view images, we conduct interpolation on the previous anchor views to generate extra dense views. This framework yields dense, multi-view consistent images, providing comprehensive 3D information. To further enhance the overall generation quality, we introduce a coarse-to-fine sampling strategy for the reconstruction algorithm to robustly extract textured meshes from the generated dense images. Extensive experiments demonstrate that our method is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.
Paper Structure (23 sections, 9 equations, 8 figures, 3 tables)

This paper contains 23 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Envision3D generates 32 dense view images and extracts high-quality 3D content from one input image in 3-4 minutes.
  • Figure 2: Overview of Envision3D. Given an input image, Stage I first generates anchor view images with aligned normal maps. Then Stage II interpolates between previous anchor views to generate dense interpolation views. Finally, 3D content is generated through the textured mesh extraction.
  • Figure 3: a) Stage I diffusion model. We implement multi-view attention and cross-domain attention to enforce multi-view consistency and domain alignment. We propose an Instruction Representation Injection (IRI) module to inject image-normal pairs into the diffusion model. b) Stage II diffusion model. We fine-tune the video diffusion model composed of spatial-temporal blocks, ensuring consistency among local dense views. The conditional anchor view latents are reorganized and concatenated with noisy latents, which are taken as model input.
  • Figure 4: The qualitative results generated by Envision3D. We collect various images and test the performance of our method. The leftmost and rightmost column shows input images and the generated 3D content respectively.
  • Figure 5: The qualitative comparisons with baseline methods. We compare re-rendered views from generated 3D content on the 4 samples from GSO dataset and 2 collected images. The rightmost column shows our generated 3D textured meshes.
  • ...and 3 more figures