Table of Contents
Fetching ...

GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting

Sixu Li, Ben Keller, Yingyan Celine Lin, Brucek Khailany

TL;DR

This work tackles the challenge of real-time 3D Gaussian Splatting (3DGS) on edge GPUs by reusing and enhancing the GPU triangle rasterizer rather than adding a dedicated accelerator. The authors introduce GauRast, a reconfigurable rasterizer that handles both Gaussian and triangle primitives, supported by a CUDA-Collaborative scheduling scheme to keep the pipeline pipelined. Their hardware prototype reports 23× speedup and 24× energy efficiency improvements on an edge SoC for the original 3DGS, and 6× end-to-end speedup (24 FPS) with 0.2% area overhead, plus 4× (46 FPS) for the latest efficiency-optimized pipeline, while remaining compatible with non-NVIDIA GPUs. Overall, GauRast demonstrates a practical path to real-time 3DGS on resource-constrained platforms, enabling broader deployment of neural rendering techniques across edge devices.

Abstract

3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerators that require substantial integration overhead and hardware costs. This work proposes an acceleration strategy that leverages the similarities between the 3DGS pipeline and the highly optimized conventional graphics pipeline in modern GPUs. Instead of developing a dedicated accelerator, we enhance existing GPU rasterizer hardware to efficiently support 3DGS operations. Our results demonstrate a 23$\times$ increase in processing speed and a 24$\times$ reduction in energy consumption, with improvements yielding 6$\times$ faster end-to-end runtime for the original 3DGS algorithm and 4$\times$ for the latest efficiency-improved pipeline, achieving 24 FPS and 46 FPS respectively. These enhancements incur only a minimal area overhead of 0.2\% relative to the entire SoC chip area, underscoring the practicality and efficiency of our approach for enabling 3DGS rendering on resource-constrained platforms.

GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting

TL;DR

This work tackles the challenge of real-time 3D Gaussian Splatting (3DGS) on edge GPUs by reusing and enhancing the GPU triangle rasterizer rather than adding a dedicated accelerator. The authors introduce GauRast, a reconfigurable rasterizer that handles both Gaussian and triangle primitives, supported by a CUDA-Collaborative scheduling scheme to keep the pipeline pipelined. Their hardware prototype reports 23× speedup and 24× energy efficiency improvements on an edge SoC for the original 3DGS, and 6× end-to-end speedup (24 FPS) with 0.2% area overhead, plus 4× (46 FPS) for the latest efficiency-optimized pipeline, while remaining compatible with non-NVIDIA GPUs. Overall, GauRast demonstrates a practical path to real-time 3DGS on resource-constrained platforms, enabling broader deployment of neural rendering techniques across edge devices.

Abstract

3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerators that require substantial integration overhead and hardware costs. This work proposes an acceleration strategy that leverages the similarities between the 3DGS pipeline and the highly optimized conventional graphics pipeline in modern GPUs. Instead of developing a dedicated accelerator, we enhance existing GPU rasterizer hardware to efficiently support 3DGS operations. Our results demonstrate a 23 increase in processing speed and a 24 reduction in energy consumption, with improvements yielding 6 faster end-to-end runtime for the original 3DGS algorithm and 4 for the latest efficiency-improved pipeline, achieving 24 FPS and 46 FPS respectively. These enhancements incur only a minimal area overhead of 0.2\% relative to the entire SoC chip area, underscoring the practicality and efficiency of our approach for enabling 3DGS rendering on resource-constrained platforms.

Paper Structure

This paper contains 19 sections, 2 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Representative examples of 3D intelligent applications, including autonomous driving, robotics, and augmented/virtual reality intro-selfdrivingintro-roboticsintro-arvr.
  • Figure 2: Visualization of a 3D Gaussian representation: (a) Rendered RGB image depicting the scene with realistic color and detail; and (b) the corresponding Gaussian ball representation of the same scene, showing the underlying 3D structure before rendering. Both images are rendered using the 'Bonsai' scene from the NeRF-360 barron2022mip dataset.
  • Figure 3: Overview of the 3DGS pipeline kerbl20233d: (a) Scene representation: The scene is depicted as 3D Gaussian balls, viewed from a specific viewpoint and projected onto a 2D pixel plane. (b) Preprocessing: These 3D Gaussians are projected onto the 2D plane, resulting in 2D Gaussian representations. (c) Sorting: 2D Gaussians are ordered by depth to ensure the correct rendering sequence and handle occlusion properly. (d) Initial rasterization: Colors for each pixel are calculated based on the Gaussians covering that pixel. (e) Color accumulation: Colors are accumulated to produce the final pixel color output.
  • Figure 4: Throughput achieved by the 3DGS rendering pipeline kerbl20233d across all seven scenes from the large-scale, real-world NeRF-360 dataset barron2022mip, as measured on the NVIDIA Jetson Orin NX orinnx with a 10W power limit.
  • Figure 5: Runtime breakdown of the 3DGS rendering pipeline kerbl20233d across all seven scenes from the large-scale, real-world NeRF-360 dataset barron2022mip as measured on the NVIDIA Jetson Orin NX orinnx with a power limit of 10W.
  • ...and 6 more figures