Table of Contents
Fetching ...

ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction

Quanyuan Ruan, Kewei Shi, Jiabao Lei, Xifeng Gao, Xiaoguang Han

Abstract

Auto-regressive frameworks for next-scale prediction of 2D images have demonstrated strong potential for producing diverse and sophisticated content by progressively refining a coarse input. However, extending this paradigm to 3D object generation remains largely unexplored. In this paper, we introduce auto-regressive Gaussian splatting (ARGS), a framework for making next-scale predictions in parallel for generation according to levels of detail. We propose a Gaussian simplification strategy and reverse the simplification to guide next-scale generation. Benefiting from the use of hierarchical trees, the generation process requires only \(\mathcal{O}(\log n)\) steps, where \(n\) is the number of points. Furthermore, we propose a tree-based transformer to predict the tree structure auto-regressively, allowing leaf nodes to attend to their internal ancestors to enhance structural consistency. Extensive experiments demonstrate that our approach effectively generates multi-scale Gaussian representations with controllable levels of detail, visual fidelity, and a manageable time consumption budget.

ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction

Abstract

Auto-regressive frameworks for next-scale prediction of 2D images have demonstrated strong potential for producing diverse and sophisticated content by progressively refining a coarse input. However, extending this paradigm to 3D object generation remains largely unexplored. In this paper, we introduce auto-regressive Gaussian splatting (ARGS), a framework for making next-scale predictions in parallel for generation according to levels of detail. We propose a Gaussian simplification strategy and reverse the simplification to guide next-scale generation. Benefiting from the use of hierarchical trees, the generation process requires only \(\mathcal{O}(\log n)\) steps, where is the number of points. Furthermore, we propose a tree-based transformer to predict the tree structure auto-regressively, allowing leaf nodes to attend to their internal ancestors to enhance structural consistency. Extensive experiments demonstrate that our approach effectively generates multi-scale Gaussian representations with controllable levels of detail, visual fidelity, and a manageable time consumption budget.

Paper Structure

This paper contains 19 sections, 12 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: We propose a novel framework for generating Gaussian Splatting fields using a next-scale autoregressive prediction paradigm. Unlike diffusion-based approaches (e.g., DDPMho2020denoising), which iteratively denoise latent representations, or conventional autoregressive models that synthesize points sequentially, our method predicts the Level-of-Detail (LoD) hierarchy of the Gaussian Splatting field and reconstructs the scene in only $\log n$ steps, where $n$ denotes the number of Gaussians.
  • Figure 2: A sample of our simplification process. From left to right and top to bottom, we visualize how a Gaussian Splatting object is progressively downsampled to a single point.
  • Figure 3: Comparison of autoregressive generation steps. Numbers on the top right indicate the number of Gaussian points. Top: Vanilla autoregressive (AR) models predict one token at a time; therefore, generating $n$ points requires $n-1$ sequential steps. Bottom: Our hierarchical AR model predicts the next level in a spatial hierarchy, where each level expansion generates multiple points in parallel. This hierarchical formulation reduces the generation complexity from linear to logarithmic, requiring only $\log n$ steps.
  • Figure 4: Hierarchical Spatial Structure.Left: The simplification process merges node pairs iteratively to form higher-level nodes; independent pairs can be merged in parallel. Middle: The binary tree view reverses the simplification process, reconstructing the hierarchy from the root and removing duplicate nodes to obtain level-wise data. Right: The hierarchical spatial tree representation stores only the leaf nodes at each level, discarding internal nodes that have already been split.
  • Figure 5: Attention Mask.Left: The causal attention mask. It depends on a sorted sequence and generates tokens sequentially, requiring $n$ steps in total. Middle: The level-wise attention mask. It attends only to the leaf nodes within each level, reducing the generation complexity to $\log n$ steps. Right: The tree-based attention mask. It extends level-wise attention by also considering internal nodes from previous levels, allowing token generation within $\log n$ steps. The rightmost panel illustrates how many tokens are required to decode each level of detail.
  • ...and 6 more figures