Table of Contents
Fetching ...

S-MUSt3R: Sliding Multi-view 3D Reconstruction

Leonid Antsfeld, Boris Chidlovskii, Yohann Cabon, Vincent Leroy, Jerome Revaud

TL;DR

S-MUSt3R tackles scalable monocular 3D reconstruction from long, uncalibrated RGB sequences by sliding MUSt3R across overlapping segments, then aligning and stitching the local reconstructions with a lightweight loop-closure pose graph. It introduces confidence-depth fusion, a choice among transform groups ($SIM(3)$, $Affine(3)$, $SL(4)$), and an efficient loop-detection mechanism that yields a compact optimization problem. The approach preserves metric-space predictions without retraining and achieves competitive results on the TUM and 7-Scenes benchmarks, often outperforming non-metric, long-sequence baselines like VGGT-Long. The study demonstrates that a simple segment-based strategy with robust alignment and loop closure can match or exceed the performance of more complex backends while remaining practical for real-world robotic applications.

Abstract

The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.

S-MUSt3R: Sliding Multi-view 3D Reconstruction

TL;DR

S-MUSt3R tackles scalable monocular 3D reconstruction from long, uncalibrated RGB sequences by sliding MUSt3R across overlapping segments, then aligning and stitching the local reconstructions with a lightweight loop-closure pose graph. It introduces confidence-depth fusion, a choice among transform groups (, , ), and an efficient loop-detection mechanism that yields a compact optimization problem. The approach preserves metric-space predictions without retraining and achieves competitive results on the TUM and 7-Scenes benchmarks, often outperforming non-metric, long-sequence baselines like VGGT-Long. The study demonstrates that a simple segment-based strategy with robust alignment and loop closure can match or exceed the performance of more complex backends while remaining practical for real-world robotic applications.

Abstract

The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.
Paper Structure (15 sections, 3 equations, 5 figures, 5 tables)

This paper contains 15 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: S-MUSt3R partitions the long image sequence in segments, applies MUSt3R model, then aligns the segments and corrects the accumulated drift with lightweight loop closure and optimization.
  • Figure 2: Example of a pose graph where segments are nodes (nodes in blue for sequence segments and an extra node in green for loop closure) and edges are constrained by pose-driven and pointmap-driven alignments.
  • Figure 3: 3D reconstruction with S-MUSt3R for room and floor sequences of TUM dataset and office and chess of 7-Scenes dataset. Top: 3D reconstruction with original RGB colors; Bottom: segment pointmaps shown with different colors.
  • Figure 4: Segment pose correction with lightweight loop closure, for TUM (top) and 7-Scenes (down) datasets.
  • Figure 5: 3D reconstruction from a robot navigation sequence with S-MUSt3R in RGB (left) and segment-colored (middle) versions and with VGGT-Long (right).