Table of Contents
Fetching ...

GEVO: Memory-Efficient Monocular Visual Odometry Using Gaussians

Dasong Gao, Peter Zhi Xuan Li, Vivienne Sze, Sertac Karaman

TL;DR

GEVO tackles memory and energy limitations in monocular GS-based SLAM by rendering past views from a compact Gaussian map instead of storing numerous images, thereby mitigating memory overhead on mobile devices. It introduces occupancy-preserving initialization to prune occlusions and consistency-aware optimization to prevent overfitting, enabling high-fidelity maps despite not retraining on stored past imagery. Across Replica and TUM-RGBD datasets, GEVO achieves comparable rendering and localization accuracy while reducing overhead to about 58 MB, up to 94x lower than prior methods, making GS-based SLAM feasible on energy-constrained platforms. The approach paves the way for real-time, dense, photo-realistic mapping on devices with tight memory budgets, balancing fidelity and practicality for AR/VR and mobile robotics.

Abstract

Constructing a high-fidelity representation of the 3D scene using a monocular camera can enable a wide range of applications on mobile devices, such as micro-robots, smartphones, and AR/VR headsets. On these devices, memory is often limited in capacity and its access often dominates the consumption of compute energy. Although Gaussian Splatting (GS) allows for high-fidelity reconstruction of 3D scenes, current GS-based SLAM is not memory efficient as a large number of past images is stored to retrain Gaussians for reducing catastrophic forgetting. These images often require two-orders-of-magnitude higher memory than the map itself and thus dominate the total memory usage. In this work, we present GEVO, a GS-based monocular SLAM framework that achieves comparable fidelity as prior methods by rendering (instead of storing) them from the existing map. Novel Gaussian initialization and optimization techniques are proposed to remove artifacts from the map and delay the degradation of the rendered images over time. Across a variety of environments, GEVO achieves comparable map fidelity while reducing the memory overhead to around 58 MBs, which is up to 94x lower than prior works.

GEVO: Memory-Efficient Monocular Visual Odometry Using Gaussians

TL;DR

GEVO tackles memory and energy limitations in monocular GS-based SLAM by rendering past views from a compact Gaussian map instead of storing numerous images, thereby mitigating memory overhead on mobile devices. It introduces occupancy-preserving initialization to prune occlusions and consistency-aware optimization to prevent overfitting, enabling high-fidelity maps despite not retraining on stored past imagery. Across Replica and TUM-RGBD datasets, GEVO achieves comparable rendering and localization accuracy while reducing overhead to about 58 MB, up to 94x lower than prior methods, making GS-based SLAM feasible on energy-constrained platforms. The approach paves the way for real-time, dense, photo-realistic mapping on devices with tight memory budgets, balancing fidelity and practicality for AR/VR and mobile robotics.

Abstract

Constructing a high-fidelity representation of the 3D scene using a monocular camera can enable a wide range of applications on mobile devices, such as micro-robots, smartphones, and AR/VR headsets. On these devices, memory is often limited in capacity and its access often dominates the consumption of compute energy. Although Gaussian Splatting (GS) allows for high-fidelity reconstruction of 3D scenes, current GS-based SLAM is not memory efficient as a large number of past images is stored to retrain Gaussians for reducing catastrophic forgetting. These images often require two-orders-of-magnitude higher memory than the map itself and thus dominate the total memory usage. In this work, we present GEVO, a GS-based monocular SLAM framework that achieves comparable fidelity as prior methods by rendering (instead of storing) them from the existing map. Novel Gaussian initialization and optimization techniques are proposed to remove artifacts from the map and delay the degradation of the rendered images over time. Across a variety of environments, GEVO achieves comparable map fidelity while reducing the memory overhead to around 58 MBs, which is up to 94x lower than prior works.
Paper Structure (16 sections, 6 equations, 5 figures, 2 tables)

This paper contains 16 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: During online GS-based SLAM, the map (consisting of 3D Gaussians) is built by rendering and optimizing at each viewpoint using a sliding window buffer of images. a) The region visible during the current sliding window achieves high fidelity after initial optimization. b) However, without storing and retraining the map on a large number of past images, the fidelity of the same region degrades over time due to forgetting (artifacts in rectangles). c) While alleviating forgetting, our GEVO avoids storing past images to reduce the memory overhead. d) To achieve similar map fidelity, MonoGS matsuki2023gaussian stores all past keyframes and incurs a memory overhead at least 50$\times$ higher than the size of the map.
  • Figure 2: Two scenarios that cause catastrophic forgetting in Gaussian Splatting: b) retrospective occlusion (RO) and a) incomplete ray obscuration (IRO). RO causes the new Gaussians to occlude ones in the past view (red rectangles in \ref{['fig:forgetting-visualization-last']}). IRO causes the existing Gaussians to overfit to the current view (green rectangles in \ref{['fig:forgetting-visualization-last']}).
  • Figure 3: Occupancy-Preserving Initialization. Given a set of recently acquired keyframes and poses in a sliding window buffer, the depth image for the most recent keyframe is computed using belief propagation on a photometric cost volume at a quarter of the image's resolution. Then, the depth and RGB image are used to initialize a set of Gaussians ($\mathcal{G}_t$) for representing obstacles (red) and free region (blue) using the SPGF* algorithm gmmap. Gaussians representing free regions are fused across multiple keyframes to identify instances of retrospective occlusion (RO) during consistency-aware optimization.
  • Figure 4: Consistency-Aware Optimization. Given newly initialized Gaussians ($\mathcal{G}_t$), we perform a GS-based optimization in two stages: i) Local stage performs GS to optimize a local map $\widetilde{\mathcal{M}}_{t}$ that represents all geometries visible from the sliding window, and ii) Global stage selectively optimizes a small active set $\mathcal{A}_t$ of Gaussians (green) consisting of the local map $\widetilde{\mathcal{M}}_{t}$ and existing Gaussians $\mathcal{E}_t$ with high rendering error. This active set selection tends to exclude Gaussians obscured from camera views used for training and thus reduces IRO. Since Gaussians from the local map are sufficiently accurate, images from randomly selected past views are rendered from the global map to guide the global optimization stage. Finally, Gaussians that causes RO are pruned with the help of the obstacle-free regions created during initialization (blue). Note that Gaussians representing free regions are omitted in the global stage except for the pruning step for ease of visualization.
  • Figure 5: GEVO achieves comparable rendering accuracy with other monocular methods on Replica (top) and TUM-RGBD (bottom). In particular, GEVO achieves high fidelity by reducing RO, especially for Gaussians representing distant and/or large objects (in green rectangles). Since rendered images degrade in quality slowly over time, optimizating Gaussians using these images in GEVO leads to minor loss of details in some close-up, feature-rich regions (in red rectangles).