XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Xuanchi Ren; Jiahui Huang; Xiaohui Zeng; Ken Museth; Sanja Fidler; Francis Williams

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, Francis Williams

TL;DR

XCube introduces a hierarchical sparse voxel latent diffusion model for high‑resolution 3D generation, capable of producing up to $1024^3$ voxels with rich attributes in a feed‑forward pass. Each level in the sparse voxel hierarchy is encoded by a level‑specific Sparse Structure VAE and generated by a diffusion model conditioned on the coarser level, with cross‑attention and AdaGN enabling text and category guidance. A VDB‑based sparse 3D framework powers fast sampling and scalable memory usage, enabling complex shapes and large outdoor scenes within under 30 seconds. Empirical results on ShapeNet, Objaverse, Waymo, and Karton City demonstrate state‑of‑the‑art quality, multi‑modal capabilities (including text‑to‑3D and single‑scan completions), and strong ablations validating the hierarchical, coarse‑to‑fine design and the critical role of progressive pruning.

Abstract

We present XCube (abbreviated as $\mathcal{X}^3$), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

TL;DR

XCube introduces a hierarchical sparse voxel latent diffusion model for high‑resolution 3D generation, capable of producing up to

voxels with rich attributes in a feed‑forward pass. Each level in the sparse voxel hierarchy is encoded by a level‑specific Sparse Structure VAE and generated by a diffusion model conditioned on the coarser level, with cross‑attention and AdaGN enabling text and category guidance. A VDB‑based sparse 3D framework powers fast sampling and scalable memory usage, enabling complex shapes and large outdoor scenes within under 30 seconds. Empirical results on ShapeNet, Objaverse, Waymo, and Karton City demonstrate state‑of‑the‑art quality, multi‑modal capabilities (including text‑to‑3D and single‑scan completions), and strong ablations validating the hierarchical, coarse‑to‑fine design and the critical role of progressive pruning.

Abstract

We present XCube (abbreviated as

), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to

in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m

100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.

Paper Structure (22 sections, 10 equations, 24 figures, 5 tables)

This paper contains 22 sections, 10 equations, 24 figures, 5 tables.

Introduction
Related Work
Method
Sparse Structure VAE
Hierarchical Voxel Latent Diffusion
Training and Sampling
Implementation Details
Experiments
Object-level 3D Generation on ShapeNet
Object-level 3D Generation on Objaverse
Large-scale Scene-level 3D Generation
Ablation Study
Discussion
Sparse 3D Learning Framework
Implementation Details
...and 7 more sections

Figures (24)

Figure 1: XCube ($\mathcal{X}^3$). Our model generates high-resolution (up to $1024^3$) sparse 3D voxel hierarchies of objects and driving scenes in under 30 seconds. The voxels are enriched with arbitrary attributes such as semantics, normals, and TSDF from which mesh could be readily extracted. Here we show randomly sampled geometries using our model trained on ShapeNet, Objaverse, Karton City, and Waymo.
Figure 2: Method. Sparse voxel grids within the hierarchy are first encoded into compact latent representations using a sparse structure VAE. The hierarchical latent diffusion model then learns to generate each level of the latent representation conditioned on the coarser level in a cascaded fashion. The generated high-resolution voxel grids contain various attributes for different applications. Note that technically ${\mathbf{X}}_1$ is a dense latent grid, but illustrated as a sparse one for clarity.
Figure 3: VAE Decoder Architecture. Coarser levels of grids ${\mathbf{G}}_l$ are upsampled to finer grids ${\mathbf{G}}_{l+1}$ by iteratively subdividing existing voxels into octants and pruning excessive ones. Each level may contain many upsampling layers that double the resolution.
Figure 4: ShapeNet Shapenet Qualitative Comparison. We show comparison of our method with LION zeng2022lion, NFD NFD, and NWD NWD. Our method is capable of generating intricate geometry and thin structures. Best viewed with 200% zoom-in.
Figure 5: Close-up View of Our Generated Shape. The voxel grid is colored by predicted normal. XCube is able to generate a high level of detail, such as the car interior and airplane propellers.
...and 19 more figures

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

TL;DR

Abstract

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Authors

TL;DR

Abstract

Table of Contents

Figures (24)