Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation
Xuyi Meng, Chen Wang, Jiahui Lei, Kostas Daniilidis, Jiatao Gu, Lingjie Liu
TL;DR
Zero-1-to-G presents a one-stage direct 3D generation framework that harnesses pretrained 2D diffusion priors by decomposing Gaussian splats into multi-view attribute images and employing cross-view and cross-attribute attention. A VAE decoder is finetuned to align splatter renderings with diffusion priors, enabling efficient training on large 2D priors while preserving 3D structure. The method achieves superior 3D geometry and rendering quality on unseen objects and in-the-wild data, with notable gains in view-consistency and generalization, and demonstrates favorable training efficiency compared to prior 3D diffusion methods. This approach offers a scalable path for high-fidelity 3D content generation by effectively bridging 2D diffusion priors with 3D representations.
Abstract
Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.
