Table of Contents
Fetching ...

Splatonic: Architecture Support for 3D Gaussian Splatting SLAM via Sparse Processing

Xiaotong Huang, He Zhu, Tianrui Ma, Yuxiang Xiong, Fangxin Liu, Zhezhi He, Yiming Gan, Zihan Liu, Jingwen Leng, Yu Feng, Minyi Guo

TL;DR

<3-5 sentence high-level summary> Splatonic tackles the real-time constraint of 3D Gaussian splatting SLAM on mobile devices by introducing adaptive sparse pixel sampling and a pixel-based rendering pipeline, paired with a co-designed hardware accelerator. The approach reduces rasterization workload dramatically and shifts certain computations (like α-checking) to projection, enabling Gaussian-level parallelism and reduced warp divergence. The authors demonstrate up to hundreds of times speedups and orders-of-magnitude energy savings across multiple 3DGS-SLAM algorithms, both in software on mobile GPUs and in dedicated hardware, with accuracy maintained or improved. This work shows a viable path to deploying high-fidelity 3DGS-SLAM in resource-constrained platforms and highlights the potential of pixel-based sparse processing for neural rendering tasks.

Abstract

3D Gaussian splatting (3DGS) has emerged as a promising direction for SLAM due to its high-fidelity reconstruction and rapid convergence. However, 3DGS-SLAM algorithms remain impractical for mobile platforms due to their high computational cost, especially for their tracking process. This work introduces Splatonic, a sparse and efficient real-time 3DGS-SLAM algorithm-hardware co-design for resource-constrained devices. Inspired by classical SLAMs, we propose an adaptive sparse pixel sampling algorithm that reduces the number of rendered pixels by up to 256$\times$ while retaining accuracy. To unlock this performance potential on mobile GPUs, we design a novel pixel-based rendering pipeline that improves hardware utilization via Gaussian-parallel rendering and preemptive $α$-checking. Together, these optimizations yield up to 121.7$\times$ speedup on the bottleneck stages and 14.6$\times$ end-to-end speedup on off-the-shelf GPUs. To further address new bottlenecks introduced by our rendering pipeline, we propose a pipelined architecture that simplifies the overall design while addressing newly emerged bottlenecks in projection and aggregation. Evaluated across four 3DGS-SLAM algorithms, Splatonic achieves up to 274.9$\times$ speedup and 4738.5$\times$ energy savings over mobile GPUs and up to 25.2$\times$ speedup and 241.1$\times$ energy savings over state-of-the-art accelerators, all with comparable accuracy.

Splatonic: Architecture Support for 3D Gaussian Splatting SLAM via Sparse Processing

TL;DR

<3-5 sentence high-level summary> Splatonic tackles the real-time constraint of 3D Gaussian splatting SLAM on mobile devices by introducing adaptive sparse pixel sampling and a pixel-based rendering pipeline, paired with a co-designed hardware accelerator. The approach reduces rasterization workload dramatically and shifts certain computations (like α-checking) to projection, enabling Gaussian-level parallelism and reduced warp divergence. The authors demonstrate up to hundreds of times speedups and orders-of-magnitude energy savings across multiple 3DGS-SLAM algorithms, both in software on mobile GPUs and in dedicated hardware, with accuracy maintained or improved. This work shows a viable path to deploying high-fidelity 3DGS-SLAM in resource-constrained platforms and highlights the potential of pixel-based sparse processing for neural rendering tasks.

Abstract

3D Gaussian splatting (3DGS) has emerged as a promising direction for SLAM due to its high-fidelity reconstruction and rapid convergence. However, 3DGS-SLAM algorithms remain impractical for mobile platforms due to their high computational cost, especially for their tracking process. This work introduces Splatonic, a sparse and efficient real-time 3DGS-SLAM algorithm-hardware co-design for resource-constrained devices. Inspired by classical SLAMs, we propose an adaptive sparse pixel sampling algorithm that reduces the number of rendered pixels by up to 256 while retaining accuracy. To unlock this performance potential on mobile GPUs, we design a novel pixel-based rendering pipeline that improves hardware utilization via Gaussian-parallel rendering and preemptive -checking. Together, these optimizations yield up to 121.7 speedup on the bottleneck stages and 14.6 end-to-end speedup on off-the-shelf GPUs. To further address new bottlenecks introduced by our rendering pipeline, we propose a pipelined architecture that simplifies the overall design while addressing newly emerged bottlenecks in projection and aggregation. Evaluated across four 3DGS-SLAM algorithms, Splatonic achieves up to 274.9 speedup and 4738.5 energy savings over mobile GPUs and up to 25.2 speedup and 241.1 energy savings over state-of-the-art accelerators, all with comparable accuracy.

Paper Structure

This paper contains 68 sections, 3 equations, 27 figures.

Figures (27)

  • Figure 1: Overview of 3DGS-SLAM process. Tracking and mapping share the same optimization pipeline with different optimization targets. Tracking optimizes camera poses $\{C_t\}$ while mapping reconstructs the scene $\{G_i\}$.
  • Figure 2: The timing diagram of 3DGS-SLAM process. Tracking often runs more frequently compared to mapping. Mapping, $M_t$, at the same time, t, needs to be executed after tracking, $T_t$, due to the dependency.
  • Figure 3: The overview of 3DGS forward and backward passes. The forward pass consists of three stages: projection, sorting, and rasterization. Both projection and sorting are performed at tile granularity to amortize the computational cost across pixels, while rasterization must be performed at the pixel level to render individual pixels correctly. Because different pixels within a tile need to integrate different subsets of Gaussians. The backward pass mainly comprises two stages: reverse rasterization and re-projection. Reverse rasterization computes the partial gradients of all pixel-Gaussian pairs and aggregates them to the corresponding Gaussians. Re-projection then transforms the accumulated gradients from the camera coordinate system to the world coordinate system.
  • Figure 4: The amortized latency of tracking vs. mapping across algorithms keetha2024splatamyugay2023gaussianmatsuki2024gaussianpham2024flashslam. Tracking dominates the execution.
  • Figure 5: Normalized execution breakdown across algorithms. Rasterization and reverse rasterization dominate the execution.
  • ...and 22 more figures