Table of Contents
Fetching ...

Bolt3D: Generating 3D Scenes in Seconds

Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler

TL;DR

Bolt3D tackles fast, scalable 3D scene generation by reframing 3D creation as a conditional diffusion problem over a 3D Gaussian splat representation. It introduces a Geometry VAE and a multi-view latent diffusion model to generate per-view geometry and appearance, followed by a Gaussian head to produce opacities and covariances, enabling renderable 3D scenes without per-scene optimization. Training relies on a large-scale MASt3R-derived dataset of dense, multi-view pointmaps to supervise geometry and rendering losses, improving realism in unobserved regions. The approach achieves interactive speeds (seconds) on a single GPU and reduces inference cost by up to 300x compared to optimization-based methods, enabling scalable 3D content creation with realistic, view-consistent results.

Abstract

We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

Bolt3D: Generating 3D Scenes in Seconds

TL;DR

Bolt3D tackles fast, scalable 3D scene generation by reframing 3D creation as a conditional diffusion problem over a 3D Gaussian splat representation. It introduces a Geometry VAE and a multi-view latent diffusion model to generate per-view geometry and appearance, followed by a Gaussian head to produce opacities and covariances, enabling renderable 3D scenes without per-scene optimization. Training relies on a large-scale MASt3R-derived dataset of dense, multi-view pointmaps to supervise geometry and rendering losses, improving realism in unobserved regions. The approach achieves interactive speeds (seconds) on a single GPU and reduces inference cost by up to 300x compared to optimization-based methods, enabling scalable 3D content creation with realistic, view-consistent results.

Abstract

We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

Paper Structure

This paper contains 15 sections, 6 equations, 3 figures.

Figures (3)

  • Figure 1: Given an arbitrary number of input images, Bolt3D directly outputs a 3D representation which can be rendered at interactive frame-rates. Operating in a feed-forward manner, generation takes mere seconds. Bolt3D features a latent diffusion model with a scalable 2D architecture, trained on large-scale appearance and geometry data, enabling generation of full 360$^{\circ}$ scenes from one or multiple input images. We invite the reader to explore these scenes in the interactive viewer available on the project website.
  • Figure 2: Method. Bolt3D takes as input one or more posed, observed images, and a set of target poses (a), and outputs a renderable 3D scene (e). First, we use a multi-view latent diffusion model to sample per-view latent appearance and geometry (b). The appearance and geometry latents are independently decoded to full-resolution images and pointmaps (c) using a pre-trained image VAE decoder and our trained geometry decoder, respectively. Next, a multi-view Gaussian head predicts the opacities and scales of pixel-aligned 3D Gaussians, and refines the predicted colors. Together with the pointmap from (c), these values form Splatter Images szymanowicz24splatter (d), which can be combined to create a complete 3D Gaussian representation of the scene (e).
  • Figure 3: Qualitative results. We show renders of our 3D scenes reconstructed from just one input image (top left corner in each image) in a feed-forward manner. Inference takes only 7 seconds on a single GPU.