Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle
Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Wangbo Yu, Chaoran Feng, Yatian Pang, Bin Lin, Li Yuan
TL;DR
Cycle3D addresses the quality and consistency bottlenecks in image-to-3D generation by introducing a generation-reconstruction cycle that jointly leverages a pre-trained 2D diffusion model and a 3D reconstruction module. The method uses Gaussian Splatting as a differentiable 3D representation and enables cross-model feature interaction and reference-view injection to produce high-fidelity, multi-view-consistent 3D content from a single image. Extensive experiments show Cycle3D achieving state-of-the-art results across standard metrics and demonstrate notable improvements in texture, geometry, and diversity, including text-conditioned unseen-view generation. The approach is object-centric, scalable to large datasets, and supports text-to-3D extensions with enhanced robustness via diffusion-reconstruction coupling.
Abstract
Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content.However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. Concretely, 2D diffusion model is applied for generating high-quality texture, and the reconstruction model guarantees multi-view consistency.Moreover, 2D diffusion model can further control the generated content and inject reference-view information for unseen views, thereby enhancing the diversity and texture consistency of 3D generation during the denoising process. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baselines.
