Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering

Yonghan Lee; Jaehoon Choi; Dongki Jung; Jaeseong Yun; Soohyun Ryu; Dinesh Manocha; Suyong Yeon

Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering

Yonghan Lee, Jaehoon Choi, Dongki Jung, Jaeseong Yun, Soohyun Ryu, Dinesh Manocha, Suyong Yeon

TL;DR

Mode-GS tackles robust novel-view rendering for ground-robot datasets with sparse multi-view data and pose imperfections. It fuses monocular depth-derived pixel-aligned anchors with anchored Gaussian splats and a residual-form Gaussian decoder, together with a scale-consistent depth loss to handle monocular depth ambiguity. The method achieves state-of-the-art rendering performance on the R$^{3}$LIVE odometry dataset and competitive results on Tanks and Temples, notably without relying on LiDAR or dense SfM point clouds. Ablation confirms that depth calibration and the residual decoder enhance training speed and robustness. The approach offers a practical, point-cloud-free pipeline for ground-view rendering with free trajectories, expanding applicability in real-world robotic perception.

Abstract

We present a novel-view rendering algorithm, Mode-GS, for ground-robot trajectory datasets. Our approach is based on using anchored Gaussian splats, which are designed to overcome the limitations of existing 3D Gaussian splatting algorithms. Prior neural rendering methods suffer from severe splat drift due to scene complexity and insufficient multi-view observation, and can fail to fix splats on the true geometry in ground-robot datasets. Our method integrates pixel-aligned anchors from monocular depths and generates Gaussian splats around these anchors using residual-form Gaussian decoders. To address the inherent scale ambiguity of monocular depth, we parameterize anchors with per-view depth-scales and employ scale-consistent depth loss for online scale calibration. Our method results in improved rendering performance, based on PSNR, SSIM, and LPIPS metrics, in ground scenes with free trajectory patterns, and achieves state-of-the-art rendering performance on the R3LIVE odometry dataset and the Tanks and Temples dataset.

Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering

TL;DR

LIVE odometry dataset and competitive results on Tanks and Temples, notably without relying on LiDAR or dense SfM point clouds. Ablation confirms that depth calibration and the residual decoder enhance training speed and robustness. The approach offers a practical, point-cloud-free pipeline for ground-view rendering with free trajectories, expanding applicability in real-world robotic perception.

Abstract

Paper Structure (12 sections, 8 equations, 5 figures, 3 tables)

This paper contains 12 sections, 8 equations, 5 figures, 3 tables.

Introduction
Related Works
Preliminary
Methods
Per-View Anchor Initialization
Anchored Gaussian Splat Generation
Training from Rendering Losses
Experiment
Rendering evaluation on R$^{3}$LIVE dataset
Rendering evaluation of Tanks and Temples dataset
Ablation Studies
Conclusions, Limitations, and Future Work

Figures (5)

Figure 1: Our Mode-GS integrates monocular depth estimation with anchored Gaussian splatting, uses a scale-consistent depth calibration technique and residual-based Gaussian decoders. By incorporating dense pixel-aligned anchor points from monocular depth, anchored splatting improves robustness in scenarios without dense multi-view images and mitigates the impact of inaccurate poses in complex ground-view scenes. Our method can be developed using multi-sensor odometry poses in a point-cloud-free setting. Overall, it offers a practical and robust rendering pipeline for ground-view robotic datasets, as shown in Section V.
Figure 2: We compare the degenerate training patterns of 3DGS in scenarios without dense multi-view information. The patterns are categorized according to their type: (a) Sequential Type: SLAM-based Gaussian splatting utilizes sequential information by processing consecutive images with pose refinement, initially generating sharper images. However, their pose tends to drift and eventually diverges; (b) Non-Anchored Type: In the original 3DGS and their variants with ADC, the splats tend to drift from the true geometry without dense multi-view photometric information; (c) Anchored Type: Anchoring effectively prevents splats from becoming detached from the actual geometry.
Figure 3: Our methods consists of three main steps: (a) Per-View Anchor Initialization: Given monocular depth images, depth-scale adjustable anchors are initialized from each view. Each anchor is fixed in the 3D scene except the depth-scale toward the corresponding view. (b) Anchor Decoding with Residual-Form Gaussian Decoder: Each anchor is decoded into $k$ Gaussian splats by our residual-form Gaussian decoders. When initialized, each anchor contains nominal Gaussian splat attributes $(\bar{\mu}_j, \bar{r}_j, \bar{c}_j, \bar{o}_j, \bar{s}_j)$ and an embedded feature $f_j$. The residual decoders generate $k$ sets of residual attributes for child splats, which are combined with nominal anchor attributes to generate child Gaussian splats. (c) Training with Scale-Consistent Depth Loss Online Depth-Scale Calibration: We use scale-consistent depth loss $\mathcal{L}_\text{depth}$ that incorporates scales for each monocular depth supervision.
Figure 4: Qualitative comparison on two scenes from the R$^{3}$LIVE dataset. Non-anchored methods, such as 3DGS kerbl20233d and GOF gof, exhibit significant splat drift in the absence of dense multi-view information in sparsely captured scenes. In contrast, both Scaffold-GS lu2024scaffoldgs and our method demonstrate robust performance due to their use of anchored splatting. Our approach delivers sharper and more accurate results, attributed to fast training from the direct initialization of splat attributes and dense, pixel-aligned anchor initialization from monocular depth estimation.
Figure 5: (a) With Residual-Form Gaussian Decoder (top) , only residual from nominal color is estimated and trained by the decoder, allowing direct color initialization and fast training. Direct-Form Gaussian Decoder (bottom)lu2024scaffoldgsververas2024sags does not allow color initialization due to its on-the-fly decoding scheme. (b) Rendering Performance (PSNR) ablation between Direct-Form Color MLP and Residual-Form Color MLP.

Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering

TL;DR

Abstract

Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering

Authors

TL;DR

Abstract

Table of Contents

Figures (5)