Table of Contents
Fetching ...

SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, Baoquan Chen

TL;DR

SLAM3R tackles real-time dense 3D reconstruction from monocular RGB video by learning end-to-end mappings from multi-view frames to dense 3D pointmaps, avoiding camera parameter estimation. It introduces a two-stage architecture: Image-to-Points (I2P) for local geometry within sliding windows and Local-to-World (L2W) for global registration using retrieved scene frames. The method achieves state-of-the-art reconstruction accuracy and completeness while running at or above $20$ FPS on RGB input and demonstrates strong generalization across diverse datasets. This work reduces reliance on depth sensors and offline optimization, enabling practical RGB-only dense scene reconstruction in real time.

Abstract

In this paper, we introduce SLAM3R, a novel and effective system for real-time, high-quality, dense 3D reconstruction using RGB videos. SLAM3R provides an end-to-end solution by seamlessly integrating local 3D reconstruction and global coordinate registration through feed-forward neural networks. Given an input video, the system first converts it into overlapping clips using a sliding window mechanism. Unlike traditional pose optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB images in each window and progressively aligns and deforms these local pointmaps to create a globally consistent scene reconstruction - all without explicitly solving any camera parameters. Experiments across datasets consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS. Code available at: https://github.com/PKU-VCL-3DV/SLAM3R.

SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

TL;DR

SLAM3R tackles real-time dense 3D reconstruction from monocular RGB video by learning end-to-end mappings from multi-view frames to dense 3D pointmaps, avoiding camera parameter estimation. It introduces a two-stage architecture: Image-to-Points (I2P) for local geometry within sliding windows and Local-to-World (L2W) for global registration using retrieved scene frames. The method achieves state-of-the-art reconstruction accuracy and completeness while running at or above FPS on RGB input and demonstrates strong generalization across diverse datasets. This work reduces reliance on depth sensors and offline optimization, enabling practical RGB-only dense scene reconstruction in real time.

Abstract

In this paper, we introduce SLAM3R, a novel and effective system for real-time, high-quality, dense 3D reconstruction using RGB videos. SLAM3R provides an end-to-end solution by seamlessly integrating local 3D reconstruction and global coordinate registration through feed-forward neural networks. Given an input video, the system first converts it into overlapping clips using a sliding window mechanism. Unlike traditional pose optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB images in each window and progressively aligns and deforms these local pointmaps to create a globally consistent scene reconstruction - all without explicitly solving any camera parameters. Experiments across datasets consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS. Code available at: https://github.com/PKU-VCL-3DV/SLAM3R.

Paper Structure

This paper contains 27 sections, 17 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: We introduce a novel dense reconstruction system - SLAM3R. It takes a monocular RGB video as input and reconstructs the scene as a dense pointcloud. The video is converted into short clips for local reconstruction (denoted as inner-window), which are then incrementally registered together (inter-window) to create a global scene model. This process runs in real-time, producing a reconstruction that is both accurate and complete.
  • Figure 2: System overview. Given an input monocular RGB video, we apply a sliding window mechanism to convert it into overlapping clips (referred to as windows). Each window is fed into an Image-to-Points (I2P) network to recover 3D points in a local coordinate system. Next, the local points are incrementally fed into a Local-to-World (L2W) network to create a globally consistent scene model. The proposed I2P and L2W networks elegantly share similar architectures. In the I2P step (Sec. \ref{['sec:inner']}), we select a keyframe as a reference to set up a local coordinate system and use the remaining frames in the window to estimate the 3D geometry captured within it. The points from the first window are used to establish the world coordinate system. We then incrementally fuse the following windows in the L2W step (Sec. \ref{['sec:inter']}). This process involves retrieving the most relevant already-registered keyframes as a reference, and integrating new keyframes. Through this iterative process, we eventually obtain the full scene reconstruction.
  • Figure 3: Illustration of a decoder block in the proposed keyframe decoder $D_{key}$. We present a minimalist modification to integrate information from different supporting images. Our approach traverses each of them, selects its token keys and values, and uses the keyframe queries to interact with them separately across the supporting images. This multi-view information is then aggregated through max-pooling. The registration decoder $D_{reg}$ and scene decoder $D_{sce}$ (described in Sec. \ref{['sec:inter']}) share the same architecture.
  • Figure 4: We visualize the reconstruction results on two scenes: Office-09 and Office 2 from the 7-Scenes shotton2013scene and Replica replica19arxiv datasets. Our method runs in real-time and achieves high-quality reconstruction comparable to the offline method DUSt3R wang2024dust3r.
  • Figure 5: Qualitative examples. We show our reconstruction results on Tanks and Temples knapitsch2017tanks, BlendedMVS yao2020blendedmvs, Map-free Reloc arnold2022map, LLFF mildenhall2019local, and ETH3D schops2019badschops2017multi datasets, as well as in-the-wild captured videos, to demonstrate SLAM3R's generalization ability.
  • ...and 4 more figures