Table of Contents
Fetching ...

AGG: Amortized Generative 3D Gaussians for Single Image to 3D

Dejia Xu, Ye Yuan, Morteza Mardani, Sifei Liu, Jiaming Song, Zhangyang Wang, Arash Vahdat

TL;DR

This work tackles single-image to 3D generation by proposing AGG, an amortized framework that directly predicts 3D Gaussian representations without per-object optimization. It introduces a coarse hybrid generator to predict Gaussian locations and texture via separate transformers, followed by a Gaussian super-resolution module that densifies the scene in latent space while integrating RGB cues. Training stabilizes through fixed Gaussian counts, canonical initialization, and warmup with pseudo labels, enabling zero-shot object generation with rendering-based supervision. Empirical results on OmniObject3D show competitive qualitative/quantitative performance with orders-of-magnitude faster inference compared to optimization-based 3D Gaussian methods and diffusion-based baselines, highlighting AGG’s practicality for real-time single-image to 3D content creation.

Abstract

Given the growing need for automatic 3D content creation pipelines, various 3D representations have been studied to generate 3D objects from a single image. Due to its superior rendering efficiency, 3D Gaussian splatting-based models have recently excelled in both 3D reconstruction and generation. 3D Gaussian splatting approaches for image to 3D generation are often optimization-based, requiring many computationally expensive score-distillation steps. To overcome these challenges, we introduce an Amortized Generative 3D Gaussian framework (AGG) that instantly produces 3D Gaussians from a single image, eliminating the need for per-instance optimization. Utilizing an intermediate hybrid representation, AGG decomposes the generation of 3D Gaussian locations and other appearance attributes for joint optimization. Moreover, we propose a cascaded pipeline that first generates a coarse representation of the 3D data and later upsamples it with a 3D Gaussian super-resolution module. Our method is evaluated against existing optimization-based 3D Gaussian frameworks and sampling-based pipelines utilizing other 3D representations, where AGG showcases competitive generation abilities both qualitatively and quantitatively while being several orders of magnitude faster. Project page: https://ir1d.github.io/AGG/

AGG: Amortized Generative 3D Gaussians for Single Image to 3D

TL;DR

This work tackles single-image to 3D generation by proposing AGG, an amortized framework that directly predicts 3D Gaussian representations without per-object optimization. It introduces a coarse hybrid generator to predict Gaussian locations and texture via separate transformers, followed by a Gaussian super-resolution module that densifies the scene in latent space while integrating RGB cues. Training stabilizes through fixed Gaussian counts, canonical initialization, and warmup with pseudo labels, enabling zero-shot object generation with rendering-based supervision. Empirical results on OmniObject3D show competitive qualitative/quantitative performance with orders-of-magnitude faster inference compared to optimization-based 3D Gaussian methods and diffusion-based baselines, highlighting AGG’s practicality for real-time single-image to 3D content creation.

Abstract

Given the growing need for automatic 3D content creation pipelines, various 3D representations have been studied to generate 3D objects from a single image. Due to its superior rendering efficiency, 3D Gaussian splatting-based models have recently excelled in both 3D reconstruction and generation. 3D Gaussian splatting approaches for image to 3D generation are often optimization-based, requiring many computationally expensive score-distillation steps. To overcome these challenges, we introduce an Amortized Generative 3D Gaussian framework (AGG) that instantly produces 3D Gaussians from a single image, eliminating the need for per-instance optimization. Utilizing an intermediate hybrid representation, AGG decomposes the generation of 3D Gaussian locations and other appearance attributes for joint optimization. Moreover, we propose a cascaded pipeline that first generates a coarse representation of the 3D data and later upsamples it with a 3D Gaussian super-resolution module. Our method is evaluated against existing optimization-based 3D Gaussian frameworks and sampling-based pipelines utilizing other 3D representations, where AGG showcases competitive generation abilities both qualitatively and quantitatively while being several orders of magnitude faster. Project page: https://ir1d.github.io/AGG/
Paper Structure (29 sections, 5 equations, 5 figures, 2 tables)

This paper contains 29 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our AGG framework. We design a novel cascaded generation pipeline that produces 3D Gaussian-based objects without per-instance optimization. Our AGG framework involves a coarse generator that predicts a hybrid representation for 3D Gaussians at a low resolution and a super-resolution module that delivers dense 3D Gaussians in the fine stage.
  • Figure 2: Architecture of our coarse hybrid generator. We first use a pre-trained DINOv2 image encoder to extract essential features and then adopt two transformers that individually map learnable query tokens to Gaussian locations and a texture field. The texture field accepts location queries from the geometry branch, and a decoding MLP further converts the interpolated plane features into Gaussian attributes.
  • Figure 3: Illustration of the second-stage Gaussian super-resolution network. We first encode the original input image and the stage one prediction separately. Then, we unite them through cross-attention at the latent space. We perform super-resolution in the latent space and decode the features jointly.
  • Figure 4: Novel view rendering comparisons against baseline methods. Our AGG model observes none of these testing images during training.
  • Figure 5: Visual comparisons of our full model (c) and its variants: (a) w/o Texture Field, (b) w/o Super Resolution.