Table of Contents
Fetching ...

Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

Zicheng Zhang, Ke Wu, Xiangting Meng, Keyu Liu, Jieru Zhao, Wenchao Ding

Abstract

Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming $\textit{Train-from-Scratch}$ optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a $\textbf{10x}$ speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global $\mathrm{Sim}(3)$ optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications. Project page: https://victkk.github.io/flash-mono.

Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

Abstract

Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications. Project page: https://victkk.github.io/flash-mono.

Paper Structure

This paper contains 29 sections, 10 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Our Results for Reconstruction and Rendering & Tracking & Speed Metrics. Our method reconstructs high-quality Gaussian maps in complex scenes with multiple rooms and varying lighting conditions. The right-side radar chart shows our rendering quality (PSNR, SSIM, LPIPS) and trajectory tracking accuracy (ATE), with reciprocals of LPIPS, ATE, and Depth L1 plotted for clarity. Our method outperforms others in both rendering quality and trajectory accuracy, offering a 10x speedup over contemporary monocular GS-SLAM methods.
  • Figure 2: Pipeline. For each new frame, our recurrent model jointly infers the camera pose and per-pixel 2DGS attributes conditioned on a hidden state. The hidden state is updated simultaneously. To avoid catastrophic forgetting, the stream is partitioned into submaps. The hidden state is reinitialized for each submap. Past hidden states are cached in the Bag of Hidden States. Upon loop detection, i.e., revisiting a location, we perform a single forward pass on the loop frame conditioned on the past hidden state to relocalize the current frame in the past submap. A following pose graph optimization is then performed to correct the full trajectory. In the backend, per-frame 2DGS attributes prediction is voxelized, merged, and refined to build a global 2DGS map.
  • Figure 3: Qualitative Rendering Results.
  • Figure 4: Qualitative Analysis on Rendered Depth.
  • Figure 5: Ablation studies. (a) Refine Iterations vs. PSNR. (b) Submap Length vs. ATE RMSE. (c) Loop Closure Settings. (d) PSNR vs. Model Size.
  • ...and 4 more figures