Table of Contents
Fetching ...

ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, Jiangmiao Pang

TL;DR

This work targets efficient, accurate monocular 3D reconstruction from image sequences by unifying data priors from 3D foundation models with a structured, LoD-aware Gaussian scene representation. ARTDECO integrates a three-module streaming pipeline (Frontend, Backend, Mapping) that combines MASt3R-based pose estimation and loop closure with pi^3 priors, while maintaining scalability through hierarchical Gaussians and distance-aware densification. Experiments across eight indoor/outdoor benchmarks show SLAM-like runtime, robust localization, and rendering quality close to per-scene optimization, validating its practicality for real-time digitization in AR/VR, robotics, and digital twins. The approach demonstrates a principled path to high-fidelity, scalable real-time 3D reconstruction from monocular input, fostering real-to-sim pipelines in complex real-world environments.

Abstract

On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.

ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

TL;DR

This work targets efficient, accurate monocular 3D reconstruction from image sequences by unifying data priors from 3D foundation models with a structured, LoD-aware Gaussian scene representation. ARTDECO integrates a three-module streaming pipeline (Frontend, Backend, Mapping) that combines MASt3R-based pose estimation and loop closure with pi^3 priors, while maintaining scalability through hierarchical Gaussians and distance-aware densification. Experiments across eight indoor/outdoor benchmarks show SLAM-like runtime, robust localization, and rendering quality close to per-scene optimization, validating its practicality for real-time digitization in AR/VR, robotics, and digital twins. The approach demonstrates a principled path to high-fidelity, scalable real-time 3D reconstruction from monocular input, fostering real-to-sim pipelines in complex real-world environments.

Abstract

On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.

Paper Structure

This paper contains 34 sections, 24 equations, 8 figures, 29 tables.

Figures (8)

  • Figure 1: ARTDECO delivers high-fidelity, interactive 3D reconstruction from monocular images, combining efficiency with robustness across indoor and outdoor scenes.
  • Figure 2: Frontend and backend modules. (a) Frontend: Images are captured from the scene and streamed into the front-end part. Each incoming frame is aligned with the latest keyframe using a matching module to compute pixel correspondences. Based on the correspondence ratio and pixel displacement, the frame is classified as a keyframe, a mapper frame, or a common frame. The selected frame, along with its pose and point cloud, is then passed to the back-end. (b) Backend: For each new keyframe, a loop-detection module evaluates its similarity with previous keyframes. If a loop is detected, the most relevant candidates are refined and connected in the factor graph; otherwise, the keyframe is linked only to recent frames. Finally, global pose optimization is performed with Gauss–Newton, and other frames are adjusted accordingly. We instantiate the matching module with MASt3R mast3r_eccv24 and the loop-detection module with $\pi^3$wang2025pi3.
  • Figure 3: Mapping process. When a keyframe or mapper frame arrives from the backend, new Gaussians are added to the scene. Multi-resolution inputs are analyzed with the Laplacian of Gaussian (LoG) operator to identify regions that require refinement, and new Gaussians are initialized at the corresponding monocular depth positions in the current view. Common frames are not used to add Gaussians but contribute through gradient-based refinement. Each primitive stores position, spherical harmonics (SH), base scale, opacity, local feature, $d_{\text{max}}$, and voxel index $v_{id}$. For rendering, the $d_{\text{max}}$ attribute determines whether a Gaussian is included at a given viewing distance, enabling consistent level-of-detail control.
  • Figure 4: Qualitative comparisons against popular on-the-fly reconstruction baselines across diverse 3D scene datasets. ARTDECO consistently preserves high-quality rendering details in complex and diverse environments, particularly in the regions highlighted with colored rectangles.
  • Figure 5: More Qualitative Reconstruction Results.
  • ...and 3 more figures