Table of Contents
Fetching ...

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, Zhicheng Yan

TL;DR

The paper tackles the inefficiency and error propagation of two-view pose-free reconstruction by introducing MV-DUSt3R, a single-stage network that fuses information from many views in one forward pass to produce per-view 3D pointmaps without camera intrinsics or poses. MV-DUSt3R+ builds on this with cross-reference-view attention to robustly integrate information across multiple reference views, further improving large-scale scene reconstructions. Both models can be extended for novel view synthesis through Gaussian splatting heads, trained jointly with reconstruction and rendering losses. Across HM3D, ScanNet, and MP3D, the authors demonstrate substantial speedups and accuracy gains over prior art, with up to 24-view inputs and sub-2-second reconstructions, making pose-free, large-scale multi-view reconstruction practical for real-time or near-real-time applications.

Abstract

Recent sparse multi-view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art. Code will be released.

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

TL;DR

The paper tackles the inefficiency and error propagation of two-view pose-free reconstruction by introducing MV-DUSt3R, a single-stage network that fuses information from many views in one forward pass to produce per-view 3D pointmaps without camera intrinsics or poses. MV-DUSt3R+ builds on this with cross-reference-view attention to robustly integrate information across multiple reference views, further improving large-scale scene reconstructions. Both models can be extended for novel view synthesis through Gaussian splatting heads, trained jointly with reconstruction and rendering losses. Across HM3D, ScanNet, and MP3D, the authors demonstrate substantial speedups and accuracy gains over prior art, with up to 24-view inputs and sub-2-second reconstructions, making pose-free, large-scale multi-view reconstruction practical for real-time or near-real-time applications.

Abstract

Recent sparse multi-view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art. Code will be released.

Paper Structure

This paper contains 25 sections, 13 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: The proposed Multi-View Dense Unconstrained Stereo 3D Reconstruction Prime (MV-DUSt3R+) is able to reconstructs large scenes from multiple pose-free RGB views. Top row: one single-room scene and one large multi-room scene reconstructed by MV-DUSt3R+ in 0.89 and 1.54 seconds using 12 and 20 input views respectively (only a subset is shown for visualization). Bottom row: MV-DUSt3R+ is able to synthesize novel views by predicting pixel-aligned Gaussian parameters. Reconstruction of such large scenes are challenging for prior methods (e.g. DUSt3R Dust3r). See \ref{['fig:recons_results_comp']} and appendix for more results with comparison.
  • Figure 2: Left: Groundtruth scene with 8 views: Three chairs surrounding one table and one more chair next to another table. Right: reconstruction and pose estimation of DUSt3R with global optimization: all chairs incorrectly surround one table. Wrong poses are marked in red.
  • Figure 3: Overview of MV-DUSt3R. Visual tokens for the reference view and other source views are shown in Blue and Green. Black straight solid lines indicate the primary token flow while gray lines indicate secondary token flow.
  • Figure 4: Top: A multi-room scene: 16 views are sampled as input to MV-DUSt3R. For clarity, only 6 are shown. 3 of them are reference view candidates, highlighted in blue. Bottom: In each row, we select a different reference view and render the reconstructed scene from 6 input views. Renderings in good and poor quality are highlighed in green and red. As the viewpoint change between the input view and the reference view increases, quality of the reconstructed scene geometry in that input view decreases.
  • Figure 5: DecBlock and CrossRefViewBlock in MV-DUSt3R+: tokens of the reference and other views are highlighted in blue and green, respectively. Each model path uses a different reference view. For clarity, only 1 of stacked DecBlock and CrossRefViewBlock are shown.
  • ...and 7 more figures