Table of Contents
Fetching ...

MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, Deva Ramanan

TL;DR

This work carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions that achieves higher quality reconstructions than prior art, particularly when rendering novel views.

Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/Z1hanW/MonoFusion.

MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

TL;DR

This work carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions that achieves higher quality reconstructions than prior art, particularly when rendering novel views.

Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/Z1hanW/MonoFusion.

Paper Structure

This paper contains 30 sections, 6 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Dynamic Scene Reconstruction from Sparse Views. MonoFusion reconstructs dynamic human behaviors, such as playing the piano or performing CPR, from four equidistant inward-facing static cameras. We visualize the RGB and depth renderings of a 45$^\circ$ novel view between two training views. Training views are shown below for reference.
  • Figure 2: Problem Setup. Our sparse-view setup ( middle) strikes a balance between ill-posed reconstructions from casual monocular captures gao2022dynamicdavis and well-constrained reconstructions from dense multi-view studio captures joo2017panoptic. Unlike existing "sparse-view" datasets like DTU jensen2014large and LLFF mildenhall2019llff, our setup is more challenging because input views are 90$^\circ$ apart with limited cross-view correspondences.
  • Figure 3: Approach. Given sparse-view video sequences of a scene (left), we aim to optimize a 3D gaussian representation over time. We begin by running DUSt3R wang2024dust3r, a static multi-view reconstruction method, on the sparse views of a given reference timestamp. This generates a global reference frame that connects all views. Next, we use MoGe wang2024moge to independently predict depth maps for each camera. Since these depth predictions are only defined up to an affine transformation, we must estimate a scale and shift for each predicted depth map across all views and time instants. To achieve this, we leverage the fact that background pixels remain static over time. Specifically, for each time instant and each view, we align the background regions of each camera's depth map to the global reference frame by adjusting the scale and shift parameters accordingly (middle, top). This process requires a foreground-background mask for all input videos (which can be obtained using off-the-shelf tools like SAM 2 ravi2024sam). To reduce occlusions and noisy depth predictions, we concatenate all aligned background depth points, and average corresponding background points (where correspondence across time is trivially given by the 2D pixel index of the unprojected pointmap) across time. Lastly, we find that motion bases constructed from feature-clustering form a more geometrically consistent set of bases (middle, bottom), than those initialized by noisy 3D tracks wang2024som. Our optimization yields a 4D scene representation from which we can rasterize RGB frames, depth maps, a foreground silhouette, and object features from novel views (right).
  • Figure 4: Qualitative analysis of held-out view synthesis on ExoRecon. We show qualitative results of held-out view synthesis (left) and a 5$^\circ$ deviation from the static camera position at the held-out timestamp (right). As compared to other multi-view baselines, our method does dramatically better at interpolating the motion of dynamic foreground (left), even from new camera views (right). We posit that Dynamic 3DGS suffers because of lack of geometric constraints and MV-SOM has duplicate foreground artifacts because of conflicting depth initialization from the four views.
  • Figure 5: Qualitative results of $45^\circ$ novel-view synthesis results on Panoptic Studio. We show qualitative novel-view synthesis results of our method compared to baselines on the softball (left) and tennis (right) sequences. We visualize the groundtruth RGB image for the $45^\circ$ at the top. Our rendered extreme novel-view RGB image closely matches ground truth. We find that all other baselines struggle to generalize to extreme novel views.
  • ...and 8 more figures