GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting
Sixu Li, Ben Keller, Yingyan Celine Lin, Brucek Khailany
TL;DR
This work tackles the challenge of real-time 3D Gaussian Splatting (3DGS) on edge GPUs by reusing and enhancing the GPU triangle rasterizer rather than adding a dedicated accelerator. The authors introduce GauRast, a reconfigurable rasterizer that handles both Gaussian and triangle primitives, supported by a CUDA-Collaborative scheduling scheme to keep the pipeline pipelined. Their hardware prototype reports 23× speedup and 24× energy efficiency improvements on an edge SoC for the original 3DGS, and 6× end-to-end speedup (24 FPS) with 0.2% area overhead, plus 4× (46 FPS) for the latest efficiency-optimized pipeline, while remaining compatible with non-NVIDIA GPUs. Overall, GauRast demonstrates a practical path to real-time 3DGS on resource-constrained platforms, enabling broader deployment of neural rendering techniques across edge devices.
Abstract
3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerators that require substantial integration overhead and hardware costs. This work proposes an acceleration strategy that leverages the similarities between the 3DGS pipeline and the highly optimized conventional graphics pipeline in modern GPUs. Instead of developing a dedicated accelerator, we enhance existing GPU rasterizer hardware to efficiently support 3DGS operations. Our results demonstrate a 23$\times$ increase in processing speed and a 24$\times$ reduction in energy consumption, with improvements yielding 6$\times$ faster end-to-end runtime for the original 3DGS algorithm and 4$\times$ for the latest efficiency-improved pipeline, achieving 24 FPS and 46 FPS respectively. These enhancements incur only a minimal area overhead of 0.2\% relative to the entire SoC chip area, underscoring the practicality and efficiency of our approach for enabling 3DGS rendering on resource-constrained platforms.
