Bolt3D: Generating 3D Scenes in Seconds
Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler
TL;DR
Bolt3D tackles fast, scalable 3D scene generation by reframing 3D creation as a conditional diffusion problem over a 3D Gaussian splat representation. It introduces a Geometry VAE and a multi-view latent diffusion model to generate per-view geometry and appearance, followed by a Gaussian head to produce opacities and covariances, enabling renderable 3D scenes without per-scene optimization. Training relies on a large-scale MASt3R-derived dataset of dense, multi-view pointmaps to supervise geometry and rendering losses, improving realism in unobserved regions. The approach achieves interactive speeds (seconds) on a single GPU and reduces inference cost by up to 300x compared to optimization-based methods, enabling scalable 3D content creation with realistic, view-consistent results.
Abstract
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
