Fillerbuster: Multi-View Scene Completion for Casual Captures
Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, Christian Richardt
TL;DR
This work addresses completing missing content in casually captured 3D scenes where camera poses may be unknown. It introduces Fillerbuster, a large-scale latent diffusion transformer that jointly models images and 6-channel raymaps to complete unseen regions and estimate poses across many input views. The approach relies on a two-branch VAE + DiT architecture with variable sequence conditioning via index embeddings, a flow-matching objective, and classifier-free guidance, achieving uncalibrated scene completion and superior multi-view inpainting on benchmark datasets. Practically, Fillerbuster enables generation of dozens of coherent novel views from casual captures, improving 3D reconstruction workflows and enabling more immersive scene experiences while highlighting directions for handling large viewpoint gaps and more diverse training data.
Abstract
We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.
