Table of Contents
Fetching ...

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, Vincent Leroy

TL;DR

MUSt3R extends DUSt3R by enabling $N$-view regression for dense 3D reconstruction without camera calibration. It introduces a symmetric, memory-backed multi-view architecture that directly predicts per-view pointmaps ${\bf X}_{i,1}$ and ${\bf X}_{i,i}$ in a common frame, and uses iterative memory with 3D feedback to support online VO/SLAM. The framework achieves state-of-the-art performance across uncalibrated Visual Odometry, relative pose estimation, 3D reconstruction, and multi-view depth on diverse benchmarks, while reducing computational complexity relative to quadratic pairwise approaches. Ablations demonstrate the benefits of a single shared decoder, memory augmentation, log-space regression, and 3D feedback for scalability and accuracy. Overall, MUSt3R provides a unified, fast, and robust platform for offline SfM and online VO/SLAM in uncalibrated, heterogeneous sensor settings.

Abstract

DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring thousands of 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and visual SLAM scenarios showing state-of-the-art performance on various 3D downstream tasks, including uncalibrated Visual Odometry, relative camera pose, scale and focal estimation, 3D reconstruction and multi-view depth estimation.

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

TL;DR

MUSt3R extends DUSt3R by enabling -view regression for dense 3D reconstruction without camera calibration. It introduces a symmetric, memory-backed multi-view architecture that directly predicts per-view pointmaps and in a common frame, and uses iterative memory with 3D feedback to support online VO/SLAM. The framework achieves state-of-the-art performance across uncalibrated Visual Odometry, relative pose estimation, 3D reconstruction, and multi-view depth on diverse benchmarks, while reducing computational complexity relative to quadratic pairwise approaches. Ablations demonstrate the benefits of a single shared decoder, memory augmentation, log-space regression, and 3D feedback for scalability and accuracy. Overall, MUSt3R provides a unified, fast, and robust platform for offline SfM and online VO/SLAM in uncalibrated, heterogeneous sensor settings.

Abstract

DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring thousands of 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and visual SLAM scenarios showing state-of-the-art performance on various 3D downstream tasks, including uncalibrated Visual Odometry, relative camera pose, scale and focal estimation, 3D reconstruction and multi-view depth estimation.

Paper Structure

This paper contains 24 sections, 9 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 2: Qualitative example of MUSt3R reconstructions of Aachen Day-Night aachen nexus4 sequence 5 (offline, top) and TUM-RGBD Sturm2012ASystems Freiburg1-room sequence (online, bottom). More qualitative examples can be found in Sec. \ref{['supsec:quali']}.
  • Figure 3: (Left)Overview of our uncalibrated reconstruction framework: an input RGB, MUSt3R architecture and the memory state. The network predicts both local ${\bf X}_{i,i}$ and global ${\bf X}_{i,1}$ pointmaps, from which camera focal, depth map, pose and dense 3D can efficiently be recovered, as seen in the global reconstruction. The memory is optionally updated according to simple heuristics depending on the scenario. (Right)Qualitative example of uncalibrated Visual Odometry on the ETH3D "boxes" sequence in the online setting.
  • Figure 4: Overview of the proposed architecture for a decoder of depth $L=3$, a Linear $\textsc{Head}^{\text{3D}}$ and without the $\textsc{Inj}^{\text{3D}}$ module. The left side shows initialization with two images. The right side shows how the memory is used and updated given a new image/frame.
  • Figure 5: The 3D feedback module for a decoder of depth $L=3$.
  • Figure 6: Qualitative example of MUSt3R reconstructions of Cambridge Landmarks cambridge.
  • ...and 4 more figures