Table of Contents
Fetching ...

GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

Nicolas von Lützow, Barbara Rössle, Katharina Schmid, Matthias Nießner

Abstract

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

Abstract

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

Paper Structure

This paper contains 51 sections, 2 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: GaussianGPT is a purely autoregressive approach for 3D Gaussian scene generation. Our approach enables unconditional scene generation, scene completion, and large-scale scene synthesis using only a single model.
  • Figure 2: Overview of GaussianGPT. A 3D Gaussian scene is subsampled into a sparse voxel grid and encoded into per-voxel features. A sparse 3D CNN compresses the grid into discrete codebook indices. The resulting latent grid is serialized via $xyz$ ordering and represented as interleaved position tokens $p_i$ and feature tokens $f_i$. A causal transformer with 3D RoPE predicts alternating position and feature tokens, which are mapped back to voxel locations and decoded through LFQ to reconstruct the 3D Gaussian scene.
  • Figure 3: Qualitative results on unconditional chair generation. From left to right: DiffRF muller_diffrf_2023, L3DG roessle_l3dg_2024, and GaussianGPT (ours). All models generate high-fidelity shapes; our method produces clean Gaussian allocations with consistent geometric structure.
  • Figure 4: Qualitative results on unconditional scene generation. We compare L3DG roessle_l3dg_2024 and GaussianGPT (ours). L3DG generates full but normalized scenes; ours generates full-scale scene chunks.
  • Figure 5: $12\,\mathrm{m} \times 12\,\mathrm{m}$ scene synthesis via autoregressive outpainting. GaussianGPT sequentially generates and appends latent grid columns, enabling scenes to grow beyond the training horizon.
  • ...and 6 more figures