Table of Contents
Fetching ...

GS4: Generalizable Sparse Splatting Semantic SLAM

Mingqi Jiang, Chanho Kim, Chen Ziwen, Li Fuxin

TL;DR

GS4 tackles the challenge of dense, semantically labeled SLAM by replacing per-scene optimization with a generalizable, feed-forward Gaussian prediction model and a learned Gaussian refinement network. The system incrementally builds a 3D map of anisotropic Gaussians, jointly predicting geometry, color, and semantics, and refines the map to maintain fidelity with minimal Gaussian count. A few-iteration joint Gaussian–pose optimization after tracking updates enhances map consistency without heavy computation. Experiments show state-of-the-art semantic SLAM performance on ScanNet and strong zero-shot generalization to NYUv2 and TUM RGB-D, all while using significantly fewer Gaussians and achieving faster runtimes than prior GS-based methods.

Abstract

Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and semantic prediction with the same backbone. Then, the Gaussian Refinement Network merges new Gaussians with the existing set while avoiding redundancy. Finally, when significant pose changes are detected, we perform only 1-5 iterations of joint Gaussian-pose optimization to correct drift, remove floaters, and further improve tracking accuracy. Experiments on the real-world ScanNet and ScanNet++ benchmarks demonstrate state-of-the-art semantic SLAM performance, with strong generalization capability shown through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.

GS4: Generalizable Sparse Splatting Semantic SLAM

TL;DR

GS4 tackles the challenge of dense, semantically labeled SLAM by replacing per-scene optimization with a generalizable, feed-forward Gaussian prediction model and a learned Gaussian refinement network. The system incrementally builds a 3D map of anisotropic Gaussians, jointly predicting geometry, color, and semantics, and refines the map to maintain fidelity with minimal Gaussian count. A few-iteration joint Gaussian–pose optimization after tracking updates enhances map consistency without heavy computation. Experiments show state-of-the-art semantic SLAM performance on ScanNet and strong zero-shot generalization to NYUv2 and TUM RGB-D, all while using significantly fewer Gaussians and achieving faster runtimes than prior GS-based methods.

Abstract

Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and semantic prediction with the same backbone. Then, the Gaussian Refinement Network merges new Gaussians with the existing set while avoiding redundancy. Finally, when significant pose changes are detected, we perform only 1-5 iterations of joint Gaussian-pose optimization to correct drift, remove floaters, and further improve tracking accuracy. Experiments on the real-world ScanNet and ScanNet++ benchmarks demonstrate state-of-the-art semantic SLAM performance, with strong generalization capability shown through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.

Paper Structure

This paper contains 23 sections, 5 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: A radar chart comparing rendering and 3d semantic metrics. We normalize each metric independently, values closer to the outer edge indicate better performance.
  • Figure 2: Comparison of PSNR with respect to number of Gaussians across Gaussian Splatting SLAM algorithms (over an average of $2,680$ frames in the 6 testing scenes of ScanNet). Our method achieves state-of-the-art performance with much fewer Gaussians. GS Num represents the number of 3D Gaussians in the scene after mapping is complete.
  • Figure 3: Overview of the SLAM System. At each timestep, the system receives an RGB-D frame as input. The tracking system performs local camera tracking and global localization to determine the current frame's pose and correct previous pose errors. Our 3D mapping process comprises three main components: 1) Gaussian Prediction (Sec \ref{['sec:gs_prediction']}): Utilizing the current frame's RGB-D data, the Gaussian Prediction Model estimates the parameters and semantic labels for all Gaussians in the current frame; 2) Gaussian Refinement (Sec \ref{['sec:gs_refinement']}): Both newly added Gaussians and those in the existing semantic 3D map are refined using the Gaussian Refinement Network to ensure that the combined set of Gaussians accurately represents the scene. A covisibility check ensures that only non-overlapping Gaussians are integrated into the existing 3D map. Post-refinement, the transparent Gaussians are pruned; 3) Few-Iteration Joint Gaussian–Pose Optimization (Sec. \ref{['sec:gs_oneiter_opt']}): If significant pose corrections are detected, we perform a few iterations of joint Gaussian–pose optimization to update the Gaussians in the 3D map and further refine the poses; the refined poses are then fed back into the tracking system. This ensures consistency of the 3D map with the revised camera trajectories and further improves pose accuracy. (Best viewed in color.)
  • Figure 4: Renderings on ScanNet. Our method, GS4, renders color & depth for views with fidelity significantly better than all approaches.
  • Figure 5: Semantic Renderings on ScanNet. Qualitative comparison on semantic synthesis of our method and baseline semantic SLAM method SGS-SLAM. Black areas in GT labels denote regions that are unannotated.
  • ...and 1 more figures