Table of Contents
Fetching ...

Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Xuyi Meng, Chen Wang, Jiahui Lei, Kostas Daniilidis, Jiatao Gu, Lingjie Liu

TL;DR

Zero-1-to-G presents a one-stage direct 3D generation framework that harnesses pretrained 2D diffusion priors by decomposing Gaussian splats into multi-view attribute images and employing cross-view and cross-attribute attention. A VAE decoder is finetuned to align splatter renderings with diffusion priors, enabling efficient training on large 2D priors while preserving 3D structure. The method achieves superior 3D geometry and rendering quality on unseen objects and in-the-wild data, with notable gains in view-consistency and generalization, and demonstrates favorable training efficiency compared to prior 3D diffusion methods. This approach offers a scalable path for high-fidelity 3D content generation by effectively bridging 2D diffusion priors with 3D representations.

Abstract

Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.

Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

TL;DR

Zero-1-to-G presents a one-stage direct 3D generation framework that harnesses pretrained 2D diffusion priors by decomposing Gaussian splats into multi-view attribute images and employing cross-view and cross-attribute attention. A VAE decoder is finetuned to align splatter renderings with diffusion priors, enabling efficient training on large 2D priors while preserving 3D structure. The method achieves superior 3D geometry and rendering quality on unseen objects and in-the-wild data, with notable gains in view-consistency and generalization, and demonstrates favorable training efficiency compared to prior 3D diffusion methods. This approach offers a scalable path for high-fidelity 3D content generation by effectively bridging 2D diffusion priors with 3D representations.

Abstract

Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.
Paper Structure (16 sections, 7 equations, 10 figures, 3 tables)

This paper contains 16 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Zero-1-to-G tackles direct Gaussian splat generation from single images. By using pretrained 2D diffusion models, we are able to generalize to in-the-wild objects.
  • Figure 2: The pipeline of Zero-1-to-G . During training, we fine-tune both the VAE decoder (Sec. \ref{['sec:finetune-vae']}) and the denoising UNet (Sec. \ref{['sec:mv-and-cd-attention']}) of Stable Diffusion. At inference time, given a single view input of the target object, each component in the splatter image is generated by conditioning the camera view and attribute switcher. The generated set of splatter image components can be directly fused into Gaussian splats (Sec. \ref{['sec:data-decomposition']}). Here we show splatter images of 3 views for better illustration, while our main experiments are conducted with 6 views.
  • Figure 3: Qualitative comparison between rendering results from splatters acquire through fitting-based methods for each object, versus the splatters predicted by our fine-tuned LGM in a feed-forward manner.
  • Figure 4: VAE encoding and decoding comparison with per-scene optimized splatters and feed-forward predicted splatters.
  • Figure 5: RGB and normal renderings of more examples on MVImgNet dataset.
  • ...and 5 more figures