Table of Contents
Fetching ...

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Wangbo Yu, Chaoran Feng, Yatian Pang, Bin Lin, Li Yuan

TL;DR

Cycle3D addresses the quality and consistency bottlenecks in image-to-3D generation by introducing a generation-reconstruction cycle that jointly leverages a pre-trained 2D diffusion model and a 3D reconstruction module. The method uses Gaussian Splatting as a differentiable 3D representation and enables cross-model feature interaction and reference-view injection to produce high-fidelity, multi-view-consistent 3D content from a single image. Extensive experiments show Cycle3D achieving state-of-the-art results across standard metrics and demonstrate notable improvements in texture, geometry, and diversity, including text-conditioned unseen-view generation. The approach is object-centric, scalable to large datasets, and supports text-to-3D extensions with enhanced robustness via diffusion-reconstruction coupling.

Abstract

Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content.However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. Concretely, 2D diffusion model is applied for generating high-quality texture, and the reconstruction model guarantees multi-view consistency.Moreover, 2D diffusion model can further control the generated content and inject reference-view information for unseen views, thereby enhancing the diversity and texture consistency of 3D generation during the denoising process. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baselines.

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

TL;DR

Cycle3D addresses the quality and consistency bottlenecks in image-to-3D generation by introducing a generation-reconstruction cycle that jointly leverages a pre-trained 2D diffusion model and a 3D reconstruction module. The method uses Gaussian Splatting as a differentiable 3D representation and enables cross-model feature interaction and reference-view injection to produce high-fidelity, multi-view-consistent 3D content from a single image. Extensive experiments show Cycle3D achieving state-of-the-art results across standard metrics and demonstrate notable improvements in texture, geometry, and diversity, including text-conditioned unseen-view generation. The approach is object-centric, scalable to large datasets, and supports text-to-3D extensions with enhanced robustness via diffusion-reconstruction coupling.

Abstract

Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content.However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. Concretely, 2D diffusion model is applied for generating high-quality texture, and the reconstruction model guarantees multi-view consistency.Moreover, 2D diffusion model can further control the generated content and inject reference-view information for unseen views, thereby enhancing the diversity and texture consistency of 3D generation during the denoising process. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baselines.
Paper Structure (18 sections, 3 equations, 8 figures, 3 tables, 2 algorithms)

This paper contains 18 sections, 3 equations, 8 figures, 3 tables, 2 algorithms.

Figures (8)

  • Figure 1: Motivation of our pipeline. Current large-scale reconstruction models often produce geometric artifacts and blurry textures due to the limited quality and consistency of the multi-view images generated by multi-view diffusion models. Our Cycle3D cyclically uses a 2D diffusion-based generation model and reconstruction model during the multi-step diffusion process. During denoising, 2D generation model improves image quality, while the reconstruction model enhances 3D consistency.
  • Figure 2: Overview of our Cycle3D. The left side illustrates the Cycle3D workflow, while the right side visualizes the denoising process. During the multi-step denoising process, the input view remains clean, the pre-trained 2D generation model gradually produces multi-view images with higher quality, while the reconstruction model continuously corrects their 3D inconsistencies. The red boxes highlight inconsistencies between the multi-view images, which are then corrected by reconstruction model.
  • Figure 3: Process of our Cycle3D. We propose a unified image-to-3D Diffusion framework that cyclically utilizes pre-trained 2D Diffusion model and 3D reconstruction model. During denoising, 2D Diffusion model can inject reference-view features, and the reconstruction model incorporates time embeddings to adapt to $\mathbf{\hat{x}}_0$ at different timesteps. Additionally, the interaction between features of reconstruction model's encoder and 2D Diffusion model's decoder enhances robustness of reconstruction. During inference, we use the multi-view images $\mathbf{ \hat{x}}'_0$ rendered by reconstruction model and the previous step $\mathbf{x}_t$ , resampling to obtain $\mathbf{x}_{t-1}$, while keeping the reference view clean.
  • Figure 4: Qualitative comparisons on image-to-3D generation. Zoom in for more details.
  • Figure 5: Qualitative ablation study by removing reference-view injection or feature interaction between 2D diffusion and reconstruction model. Multi-view prior refers to the multi-view images generated by the multi-view diffusion, used as priors of 2D Diffusion model through DDIM inversion. The red boxes highlight some abnormal textures. Reference-view injection can reduce textures in the multi-view prior that are inconsistent with input, while the absence of feature interaction significantly degrades the reconstruction quality.
  • ...and 3 more figures