GS4: Generalizable Sparse Splatting Semantic SLAM

Mingqi Jiang; Chanho Kim; Chen Ziwen; Li Fuxin

GS4: Generalizable Sparse Splatting Semantic SLAM

Mingqi Jiang, Chanho Kim, Chen Ziwen, Li Fuxin

TL;DR

GS4 tackles the challenge of dense, semantically labeled SLAM by replacing per-scene optimization with a generalizable, feed-forward Gaussian prediction model and a learned Gaussian refinement network. The system incrementally builds a 3D map of anisotropic Gaussians, jointly predicting geometry, color, and semantics, and refines the map to maintain fidelity with minimal Gaussian count. A few-iteration joint Gaussian–pose optimization after tracking updates enhances map consistency without heavy computation. Experiments show state-of-the-art semantic SLAM performance on ScanNet and strong zero-shot generalization to NYUv2 and TUM RGB-D, all while using significantly fewer Gaussians and achieving faster runtimes than prior GS-based methods.

Abstract

Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and semantic prediction with the same backbone. Then, the Gaussian Refinement Network merges new Gaussians with the existing set while avoiding redundancy. Finally, when significant pose changes are detected, we perform only 1-5 iterations of joint Gaussian-pose optimization to correct drift, remove floaters, and further improve tracking accuracy. Experiments on the real-world ScanNet and ScanNet++ benchmarks demonstrate state-of-the-art semantic SLAM performance, with strong generalization capability shown through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.

GS4: Generalizable Sparse Splatting Semantic SLAM

TL;DR

Abstract

GS4: Generalizable Sparse Splatting Semantic SLAM

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)