Table of Contents
Fetching ...

Fillerbuster: Multi-View Scene Completion for Casual Captures

Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, Christian Richardt

TL;DR

This work addresses completing missing content in casually captured 3D scenes where camera poses may be unknown. It introduces Fillerbuster, a large-scale latent diffusion transformer that jointly models images and 6-channel raymaps to complete unseen regions and estimate poses across many input views. The approach relies on a two-branch VAE + DiT architecture with variable sequence conditioning via index embeddings, a flow-matching objective, and classifier-free guidance, achieving uncalibrated scene completion and superior multi-view inpainting on benchmark datasets. Practically, Fillerbuster enables generation of dozens of coherent novel views from casual captures, improving 3D reconstruction workflows and enabling more immersive scene experiences while highlighting directions for handling large viewpoint gaps and more diverse training data.

Abstract

We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.

Fillerbuster: Multi-View Scene Completion for Casual Captures

TL;DR

This work addresses completing missing content in casually captured 3D scenes where camera poses may be unknown. It introduces Fillerbuster, a large-scale latent diffusion transformer that jointly models images and 6-channel raymaps to complete unseen regions and estimate poses across many input views. The approach relies on a two-branch VAE + DiT architecture with variable sequence conditioning via index embeddings, a flow-matching objective, and classifier-free guidance, achieving uncalibrated scene completion and superior multi-view inpainting on benchmark datasets. Practically, Fillerbuster enables generation of dozens of coherent novel views from casual captures, improving 3D reconstruction workflows and enabling more immersive scene experiences while highlighting directions for handling large viewpoint gaps and more diverse training data.

Abstract

We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.

Paper Structure

This paper contains 25 sections, 1 equation, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Completing casual captures. Fillerbuster takes an incomplete casual capture which has many images (left) and conditions on these to create many consistent novel views, shown on the right with arrows. The original images and the new ones enable novel-view synthesis (right) that is much more complete compared to vanilla Gaussian Splatting trained on only the incomplete casual capture (left).
  • Figure 2: Problem setting. We illustrate our problem setting with respect to a non-exhaustive set of related work. Many works focus on scene synthesis (left) where one generates data from text or from a single image. Similarly many tackle novel-view synthesis (bottom) to synthesize new views of the input image content. Fewer works focus on scene completion where the task is to complete missing content in captures (top right).
  • Figure 3: Model overview. Fillerbuster is trained on a large collection of multi-view images and poses (top and bottom of stacked images, respectively), which makes it useful for completing casual captures at inference time. More specifically, we are interested in four primary uses of the model: (1) conditioning on known images which have pose, (2) predicting new views where poses are provided, (3) predicting partial images where some pixels are known, or (4) recovering the camera poses when its unknown. Our model is a latent DiT trained to jointly model images and poses for any mixture of the input. In practice, our poses are 6-channel raymaps encoding ray origins and directions.
  • Figure 4: Model samples. Here we show generations from our model. For this setting, we provide pose input for all images. The top rows indicates which pixels are known, with yellow indicating unknown regions. The middle rows show the inpainted images after passing the entire sequence of size 16 (top rows) into the model for 24 denoising steps. The bottom rows show the ground truth, but note that this is not necessarily the only correct solution if the newly generated pixels are unobserved according to the masks. Notice that in the top example, the generations are self-consistent but different than the GT, which is entirely plausible.
  • Figure 5: Completing casual captures. Here we demonstrate our ability to complete casual captures from the training splits of the Nerfbusters dataset warburg2023nerfbusters. On the left, we show the input captures and some representative images. 3DGS (Splatfacto) cannot add missing details so the capture remains incomplete. Our CAT3D baseline conditions on 3 images and generates 6 images at a time, so it cannot produce consistent content. Fillerbuster conditions on 16–40 images to generate 24 novel views, and obtains the most consistent results.
  • ...and 8 more figures