Table of Contents
Fetching ...

WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

Jianhao Zheng, Zihan Zhu, Valentin Bieri, Marc Pollefeys, Songyou Peng, Iro Armeni

TL;DR

WildGS-SLAM presents a monocular SLAM framework that robustly operates in dynamic environments by representing the static scene with a $3D$ Gaussian map and predicting per-pixel uncertainty via an online MLP driven by DINOv2 features. Uncertainty informs both tracking (uncertainty-weighted dense bundle adjustment) and mapping (uncertainty-aware render loss), enabling dynamic object removal without depth or semantic priors. The approach yields artifact-free novel view synthesis and state-of-the-art performance on dynamic benchmarks, including newly collected Wild-SLAM MoCap and iPhone datasets. The work contributes a practical, generalizable dynamic-SLAM solution and a comprehensive Wild-SLAM dataset for broader evaluation in unconstrained real-world scenarios.

Abstract

We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping. This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM's superior performance in dynamic environments compared to state-of-the-art methods.

WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

TL;DR

WildGS-SLAM presents a monocular SLAM framework that robustly operates in dynamic environments by representing the static scene with a Gaussian map and predicting per-pixel uncertainty via an online MLP driven by DINOv2 features. Uncertainty informs both tracking (uncertainty-weighted dense bundle adjustment) and mapping (uncertainty-aware render loss), enabling dynamic object removal without depth or semantic priors. The approach yields artifact-free novel view synthesis and state-of-the-art performance on dynamic benchmarks, including newly collected Wild-SLAM MoCap and iPhone datasets. The work contributes a practical, generalizable dynamic-SLAM solution and a comprehensive Wild-SLAM dataset for broader evaluation in unconstrained real-world scenarios.

Abstract

We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping. This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM's superior performance in dynamic environments compared to state-of-the-art methods.

Paper Structure

This paper contains 38 sections, 8 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: WildGS-SLAM. Given a monocular video sequence captured in the wild with dynamic distractors, our method accurately tracks the camera trajectory and reconstructs a 3D Gaussian map for static elements, effectively removing all dynamic components. This approach enables high-fidelity rendering even in complex, dynamic scenes. The illustration presents the final 3D Gaussian map, the camera tracking trajectory (in red), and view synthesis comparisons with baseline methods.
  • Figure 2: System Overview. WildGS-SLAM takes a sequence of RGB images as input and simultaneously estimates the camera poses while building a 3D Gaussian map $\mathcal{G}$ of the static scene. Our method is more robust to the dynamic environment due to the uncertainty estimation module, where a pretrained DINOv2 model yue2025improving is first used to extract the image features. An uncertainty MLP $\mathcal{P}$ then utilizes the extracted features to predict per-pixel uncertainty. During the tracking, we leverage the predicted uncertainty as the weight in the dense bundle adjustment (DBA) layer to mitigate the impact of dynamic distractors. We further use monocular metric depth to facilitate the pose estimation. In the mapping module, the predicted uncertainty is incorporated into the rendering loss to update $\mathcal{G}$. Moreover, the uncertainty loss is computed in parallel to train $\mathcal{P}$. Note that $\mathcal{P}$ and $\mathcal{G}$ are optimized independently, as illustrated by the gradient flow in the gray dashed line. Faces are blurred to ensure anonymity.
  • Figure 3: Input View Synthesis Results on our Wild-SLAM MoCap Dataset. Regardless of the distractor type, our method is able to remove distractors and render realistic images. Faces are blurred to ensure anonymity.
  • Figure 4: Novel View Synthesis Results on our Wild-SLAM MoCap Dataset. PSNR metrics ($\uparrow$) are included in images.
  • Figure 5: Input View Synthesis Results on our Wild-SLAM iPhone Dataset. We only show rendering results of monocular methods, as depth images are unavailable in this dataset. Note that our uncertainty map appears blurry, as DINOv2 outputs feature maps at 1/14 of the original resolution, and for mapping we also downsample to 1/3 of the original resolution, in order to maintain SLAM system efficiency. For a high-resolution, sharper uncertainty map, the resolution can be increased at the cost of some efficiency; further details and results are provided in the supplementary materials. Faces are blurred to ensure anonymity.
  • ...and 7 more figures