Table of Contents
Fetching ...

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Lu Sheng

TL;DR

Ouroboros3D tackles the data bias and cross-view inconsistency inherent in two-stage image-to-3D pipelines by unifying multi-view diffusion and 3D reconstruction into a recursive diffusion framework. It introduces a 3D-aware feedback loop and a self-conditioning strategy to jointly train a diffusion-based multi-view generator (SVD) and a feed-forward 3D reconstructor (LGM), achieving improved geometric consistency and high-quality 3D outputs from a single image. The approach demonstrates superior performance over stage-separated pipelines and inference-time fusion methods on both multi-view and 3D reconstruction tasks, with notable gains in PSNR, SSIM, and LPIPS on standard benchmarks. This framework is extensible to different 3D representations and holds practical potential for rapid, single-image-to-3D content creation with reduced data bias.

Abstract

Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. In our framework, these two modules are jointly trained through a self-conditioning mechanism, allowing them to adapt to each other's characteristics for robust inference. During the multi-view denoising process, the multi-view diffusion model uses the 3D-aware maps rendered by the reconstruction module at the previous timestep as additional conditions. The recursive diffusion framework with 3D-aware feedback unites the entire process and improves geometric consistency.Experiments show that our framework outperforms separation of these two stages and existing methods that combine them at the inference phase. Project page: https://costwen.github.io/Ouroboros3D/

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

TL;DR

Ouroboros3D tackles the data bias and cross-view inconsistency inherent in two-stage image-to-3D pipelines by unifying multi-view diffusion and 3D reconstruction into a recursive diffusion framework. It introduces a 3D-aware feedback loop and a self-conditioning strategy to jointly train a diffusion-based multi-view generator (SVD) and a feed-forward 3D reconstructor (LGM), achieving improved geometric consistency and high-quality 3D outputs from a single image. The approach demonstrates superior performance over stage-separated pipelines and inference-time fusion methods on both multi-view and 3D reconstruction tasks, with notable gains in PSNR, SSIM, and LPIPS on standard benchmarks. This framework is extensible to different 3D representations and holds practical potential for rapid, single-image-to-3D content creation with reduced data bias.

Abstract

Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. In our framework, these two modules are jointly trained through a self-conditioning mechanism, allowing them to adapt to each other's characteristics for robust inference. During the multi-view denoising process, the multi-view diffusion model uses the 3D-aware maps rendered by the reconstruction module at the previous timestep as additional conditions. The recursive diffusion framework with 3D-aware feedback unites the entire process and improves geometric consistency.Experiments show that our framework outperforms separation of these two stages and existing methods that combine them at the inference phase. Project page: https://costwen.github.io/Ouroboros3D/
Paper Structure (17 sections, 2 equations, 8 figures, 4 tables)

This paper contains 17 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Ouroboros3D generates multi-view consistent images and high-quality 3D models from single images using 3D-aware recursive diffusion, introducing a novel 3D-aware feedback mechanism that involves iterative cycles of multi-view denoising and reconstruction.
  • Figure 2: Concept comparison between Ouroboros3D and previous two-stage methods. Instead of separating multi-view diffusion model and reconstruction model, our framework involves joint training and inference of these two models, which are established into a recursive diffusion process.
  • Figure 3: Overview of 3D-aware recursive diffusion. During multi-view denoising, the diffusion model uses 3D-aware maps rendered by the reconstruction module at the previous step as conditions.
  • Figure 4: Overview of Ouroboros3D. We adopt a video diffusion model as the multi-view generator by incorporating the input image and relative camera poses. In the denoising sampling loop, we decode the predicted $\mathbf{\widetilde{x}}_{0}^{f}$ to noise-corrupted images, which are then used to recover 3D representation by a feed-forward reconstruction model. Then the rendered color images and coordinates maps are encoded and fed into the next denoising step. At inference, the 3D-aware denoising sampling strategy iteratively refines the images by incorporating feedback from the reconstructed 3D into the denoising loop, enhancing multi-view consistency and image quality.
  • Figure 5: Qualitative comparisons of generated multi-view images. Our method achieves better consistency and quality.
  • ...and 3 more figures