Table of Contents
Fetching ...

MOGS: Monocular Object-guided Gaussian Splatting in Large Scenes

Shengkai Zhang, Yuhe Liu, Jianhua He, Xuedou Xiao, Mozi Chen, Kezhong Liu

TL;DR

MOGS tackles the scalability barrier of $3DGS$ in large scenes by combining a monocular VI‑SfM frontend with image semantics to produce metrized dense depth through object-level priors. It introduces a multi-scale shape consensus to propagate sparse SfM cues into robust object primitives and a cross-object depth refinement that optimizes a three-term objective—geometric consistency with a scale‑ambiguous LFM depth, LFM prior anchoring, and edge-aware smoothing—to enforce global coherence. Experiments on KITTI‑Depth and KITTI‑360 show substantial improvements in training efficiency (up to $30.4\%$ fewer iterations) and memory (up to $19.8\%$ less) while delivering rendering quality competitive with LiDAR‑based methods. The approach enables cost-effective, scalable high‑fidelity rendering in large scenes by leveraging object semantics and sparse geometric cues rather than dense LiDAR data.

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) deliver striking photorealism, and extending it to large scenes opens new opportunities for semantic reasoning and prediction in applications such as autonomous driving. Today's state-of-the-art systems for large scenes primarily originate from LiDAR-based pipelines that utilize long-range depth sensing. However, they require costly high-channel sensors whose dense point clouds strain memory and computation, limiting scalability, fleet deployment, and optimization speed. We present MOGS, a monocular 3DGS framework that replaces active LiDAR depth with object-anchored, metrized dense depth derived from sparse visual-inertial (VI) structure-from-motion (SfM) cues. Our key idea is to exploit image semantics to hypothesize per-object shape priors, anchor them with sparse but metrically reliable SfM points, and propagate the resulting metric constraints across each object to produce dense depth. To address two key challenges, i.e., insufficient SfM coverage within objects and cross-object geometric inconsistency, MOGS introduces (1) a multi-scale shape consensus module that adaptively merges small segments into coarse objects best supported by SfM and fits them with parametric shape models, and (2) a cross-object depth refinement module that optimizes per-pixel depth under a combinatorial objective combining geometric consistency, prior anchoring, and edge-aware smoothness. Experiments on public datasets show that, with a low-cost VI sensor suite, MOGS reduces training time by up to 30.4% and memory consumption by 19.8%, while achieving high-quality rendering competitive with costly LiDAR-based approaches in large scenes.

MOGS: Monocular Object-guided Gaussian Splatting in Large Scenes

TL;DR

MOGS tackles the scalability barrier of in large scenes by combining a monocular VI‑SfM frontend with image semantics to produce metrized dense depth through object-level priors. It introduces a multi-scale shape consensus to propagate sparse SfM cues into robust object primitives and a cross-object depth refinement that optimizes a three-term objective—geometric consistency with a scale‑ambiguous LFM depth, LFM prior anchoring, and edge-aware smoothing—to enforce global coherence. Experiments on KITTI‑Depth and KITTI‑360 show substantial improvements in training efficiency (up to fewer iterations) and memory (up to less) while delivering rendering quality competitive with LiDAR‑based methods. The approach enables cost-effective, scalable high‑fidelity rendering in large scenes by leveraging object semantics and sparse geometric cues rather than dense LiDAR data.

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) deliver striking photorealism, and extending it to large scenes opens new opportunities for semantic reasoning and prediction in applications such as autonomous driving. Today's state-of-the-art systems for large scenes primarily originate from LiDAR-based pipelines that utilize long-range depth sensing. However, they require costly high-channel sensors whose dense point clouds strain memory and computation, limiting scalability, fleet deployment, and optimization speed. We present MOGS, a monocular 3DGS framework that replaces active LiDAR depth with object-anchored, metrized dense depth derived from sparse visual-inertial (VI) structure-from-motion (SfM) cues. Our key idea is to exploit image semantics to hypothesize per-object shape priors, anchor them with sparse but metrically reliable SfM points, and propagate the resulting metric constraints across each object to produce dense depth. To address two key challenges, i.e., insufficient SfM coverage within objects and cross-object geometric inconsistency, MOGS introduces (1) a multi-scale shape consensus module that adaptively merges small segments into coarse objects best supported by SfM and fits them with parametric shape models, and (2) a cross-object depth refinement module that optimizes per-pixel depth under a combinatorial objective combining geometric consistency, prior anchoring, and edge-aware smoothness. Experiments on public datasets show that, with a low-cost VI sensor suite, MOGS reduces training time by up to 30.4% and memory consumption by 19.8%, while achieving high-quality rendering competitive with costly LiDAR-based approaches in large scenes.

Paper Structure

This paper contains 11 sections, 11 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: MOGS exploits the rich semantics of RGB images and sparse, metrically reliable SfM cues to infer object-level shape priors. By propagating these metric constraints through shape priors, it produces metrized dense depth, approaching the effect of costly high-channel LiDAR, thereby enabling better Gaussian initialization and higher-quality rendering.
  • Figure 2: System overview of MOGS. We align VI-SfM visual features with semantic masks from Segment Anything. The multi-scale shape consensus module then establishes object-level shape models and propagates SfM cues to produce metrized dense depth. Building on this, our cross-object depth refinement leverages LFM dense depth to further optimize pixel-wise estimates, yielding strong 3DGS initialization and, ultimately, higher-fidelity Gaussian splatting.
  • Figure 3: Convergence and efficiency under different GS initializations, revealing how each method grows its Gaussians along with iterations.
  • Figure 4: The numbers of iterations under different GS initializations that optimize a view rendering until 20 dB $PSNR$ on multiple datasets.
  • Figure 5: Rendering on KITTI road scenes. Each row shows one representative frame. Columns (left to right) are Ground truth, MonoGS, DepthSplat, GS-LIVM, and MOGS. Our method reconstructs sharper thin structures, preserves depth discontinuities at boundaries, and suppresses long range floaters.