XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies
Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, Francis Williams
TL;DR
XCube introduces a hierarchical sparse voxel latent diffusion model for high‑resolution 3D generation, capable of producing up to $1024^3$ voxels with rich attributes in a feed‑forward pass. Each level in the sparse voxel hierarchy is encoded by a level‑specific Sparse Structure VAE and generated by a diffusion model conditioned on the coarser level, with cross‑attention and AdaGN enabling text and category guidance. A VDB‑based sparse 3D framework powers fast sampling and scalable memory usage, enabling complex shapes and large outdoor scenes within under 30 seconds. Empirical results on ShapeNet, Objaverse, Waymo, and Karton City demonstrate state‑of‑the‑art quality, multi‑modal capabilities (including text‑to‑3D and single‑scan completions), and strong ablations validating the hierarchical, coarse‑to‑fine design and the critical role of progressive pruning.
Abstract
We present XCube (abbreviated as $\mathcal{X}^3$), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.
