Table of Contents
Fetching ...

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, Jiahui Huang

TL;DR

SCube tackles large-scale 3D scene reconstruction from sparse images by introducing VoxSplats, a voxel-grounded Gaussian splatting representation, and a two-stage pipeline that first learns a high-resolution geometry prior via an image-conditioned latent diffusion over a sparse voxel grid and then predicts per-voxel Gaussians for appearance with a sky panorama for background. The geometry stage leverages XCube as a backbone and DINO-v2–based 3D conditioning to produce detailed, semantically labeled voxels, while the appearance stage renders sharp views through a sparse UNet-based predictor and voxel-splatted Gaussians. Evaluated on Waymo data, SCube and its postprocessed variant SCube+ outperform state-of-the-art sparse-view 3D reconstruction methods in both geometry and appearance, and enable practical uses such as LiDAR simulation and text-to-scene generation. This work offers a fast, scalable pathway to high-quality large-scale 3D scenes by combining strong data priors with efficient, render-friendly representations.

Abstract

We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images. Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold. To reconstruct a VoxSplat from images, we employ a hierarchical voxel latent diffusion model conditioned on the input images followed by a feedforward appearance prediction model. The diffusion model generates high-resolution grids progressively in a coarse-to-fine manner, and the appearance network predicts a set of Gaussians within each voxel. From as few as 3 non-overlapping input images, SCube can generate millions of Gaussians with a 1024^3 voxel grid spanning hundreds of meters in 20 seconds. Past works tackling scene reconstruction from images either rely on per-scene optimization and fail to reconstruct the scene away from input views (thus requiring dense view coverage as input) or leverage geometric priors based on low-resolution models, which produce blurry results. In contrast, SCube leverages high-resolution sparse networks and produces sharp outputs from few views. We show the superiority of SCube compared to prior art using the Waymo self-driving dataset on 3D reconstruction and demonstrate its applications, such as LiDAR simulation and text-to-scene generation.

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

TL;DR

SCube tackles large-scale 3D scene reconstruction from sparse images by introducing VoxSplats, a voxel-grounded Gaussian splatting representation, and a two-stage pipeline that first learns a high-resolution geometry prior via an image-conditioned latent diffusion over a sparse voxel grid and then predicts per-voxel Gaussians for appearance with a sky panorama for background. The geometry stage leverages XCube as a backbone and DINO-v2–based 3D conditioning to produce detailed, semantically labeled voxels, while the appearance stage renders sharp views through a sparse UNet-based predictor and voxel-splatted Gaussians. Evaluated on Waymo data, SCube and its postprocessed variant SCube+ outperform state-of-the-art sparse-view 3D reconstruction methods in both geometry and appearance, and enable practical uses such as LiDAR simulation and text-to-scene generation. This work offers a fast, scalable pathway to high-quality large-scale 3D scenes by combining strong data priors with efficient, render-friendly representations.

Abstract

We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images. Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold. To reconstruct a VoxSplat from images, we employ a hierarchical voxel latent diffusion model conditioned on the input images followed by a feedforward appearance prediction model. The diffusion model generates high-resolution grids progressively in a coarse-to-fine manner, and the appearance network predicts a set of Gaussians within each voxel. From as few as 3 non-overlapping input images, SCube can generate millions of Gaussians with a 1024^3 voxel grid spanning hundreds of meters in 20 seconds. Past works tackling scene reconstruction from images either rely on per-scene optimization and fail to reconstruct the scene away from input views (thus requiring dense view coverage as input) or leverage geometric priors based on low-resolution models, which produce blurry results. In contrast, SCube leverages high-resolution sparse networks and produces sharp outputs from few views. We show the superiority of SCube compared to prior art using the Waymo self-driving dataset on 3D reconstruction and demonstrate its applications, such as LiDAR simulation and text-to-scene generation.

Paper Structure

This paper contains 22 sections, 7 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: SCube. Given sparse input images with little or no overlap, our model reconstructs a high-resolution and large-scale scene in 3D represented with VoxSplats, ready to be used for novel view synthesis or LiDAR simulation.
  • Figure 2: Framework. SCube consists of two stages: (1) We reconstruct a sparse voxel grid with semantic logit conditioned on the input images using a conditional latent diffusion model based on XCube ren2023xcube. (2) We predict the appearance of the scene represented as VoxSplats and a sky panorama using a feedforward network. Our method allows us to synthesize novel views in a fast and accurate manner, along with many other applications.
  • Figure 3: Data Processing Pipeline. We add COLMAP schonberger2016structure dense reconstruction points to the accumulated LiDAR points and compensate for dynamic objects using their bounding boxes. This provides us with a more complete geometry for training.
  • Figure 4: Novel View Synthesis. We show the synthesized novel views of SCube+ compared to baselines approaches. The inset of each subfigure shows a top-down visualization (an extreme novel view) of the reconstructed scene geometry.
  • Figure 5: Geometry Reconstruction from Sparse Views. We show the comparison between our method and Metric3Dv2 hu2024metric3dv2. The semantics of Metric3Dv2 are obtained from Segformer xie2021segformer.
  • ...and 10 more figures