Table of Contents
Fetching ...

MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping

Zhihao Cao, Hanyu Wu, Li Wa Tang, Zizhou Luo, Zihan Zhu, Wei Zhang, Marc Pollefeys, Martin R. Oswald

TL;DR

MCGS-SLAM addresses the limitations of monocular dense SLAM by leveraging synchronized RGB inputs from a calibrated multi-camera rig and a 3D Gaussian Splatting map. It introduces MCBA to jointly optimize camera poses and dense depths across views, and JDSA to enforce metric-scale consistency, all within a differentiable Gaussian-mapping and rendering framework that includes an offline global refinement. The approach yields high-fidelity, photorealistic reconstructions and accurate trajectories, benefiting from wide-field observations that reveal side-view structures otherwise occluded in single-camera setups. Evaluations on Waymo, Oxford Spires, and AirSim demonstrate robust real-time performance, superior geometry and appearance fidelity, and improved coverage essential for safe autonomous operation.

Abstract

Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.

MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping

TL;DR

MCGS-SLAM addresses the limitations of monocular dense SLAM by leveraging synchronized RGB inputs from a calibrated multi-camera rig and a 3D Gaussian Splatting map. It introduces MCBA to jointly optimize camera poses and dense depths across views, and JDSA to enforce metric-scale consistency, all within a differentiable Gaussian-mapping and rendering framework that includes an offline global refinement. The approach yields high-fidelity, photorealistic reconstructions and accurate trajectories, benefiting from wide-field observations that reveal side-view structures otherwise occluded in single-camera setups. Evaluations on Waymo, Oxford Spires, and AirSim demonstrate robust real-time performance, superior geometry and appearance fidelity, and improved coverage essential for safe autonomous operation.

Abstract

Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.

Paper Structure

This paper contains 25 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The sensor suite integrates multiple wide-angle RGB cameras centrally mounted on the vehicle’s roof in Waymo Open Dataset sun2020scalability, whose fan-shaped fields of view collectively provide full $240^\circ$ coverage. This configuration enables high-density observations for multi-camera SLAM and autonomous driving algorithms.
  • Figure 2: Our method performs real-time SLAM by fusing synchronized inputs from a multi-camera rig into a unified 3D Gaussian map. It first selects keyframes and estimates depth and normal maps for each camera, then jointly optimizes poses and depths via multi-camera bundle adjustment and scale-consistent depth alignment. Refined keyframes are fused into a dense Gaussian map using differentiable rasterization, interleaved with densification and pruning. An optional offline stage further refines camera trajectories and map quality. The system supports RGB inputs, enabling accurate tracking and photorealistic reconstruction.
  • Figure 3: Qualitative results on the Waymo dataset sun2020scalability (Real-World Dataset). MCGS-SLAM reconstructs urban scenes with higher fidelity and completeness, preserving structural details and textures that are often missed by monocular methods.
  • Figure 4: MCGS-SLAM produces faithful and complete reconstructions on AirSim airsim2017fsr (Synthetic Dataset).
  • Figure 5: Tracking performance on the Oxford Spires Dataset tao2025spires, evaluated across 4 representative sequences. Ground truth trajectories are compared against Splat-SLAM sandstrom2025splat, HI-SLAM2 zhang2024hi, and our MCGS-SLAM. MCGS-SLAM remains closely aligned with ground truth across all sequences, usually achieving the lowest ATE RMSE values and demonstrating the robustness and accuracy of our multi-camera framework in large-scale outdoor environments.