Table of Contents
Fetching ...

CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

Hexu Zhao, Xiwen Min, Xiaoteng Liu, Moonjun Gong, Yiming Li, Ang Li, Saining Xie, Jinyang Li, Aurojit Panda

TL;DR

This work addresses the memory bottleneck of 3D Gaussian Splatting (3DGS) by introducing CLM, a sparsity-guided CPU offloading system that enables large-scale models to be trained and rendered on a single consumer GPU. CLM partitions Gaussian attributes into selection-critical (GPU-resident) and non-critical (CPU-resident) data, and leverages frustum culling, precise caching, microbatch pipelining, and a TSP-based scheduling strategy to hide data-transfer and optimizer overheads. Key contributions include attribute-wise offload, precise Gaussian caching, overlapped CPU Adam, and pipeline order optimization, all implemented with pre-rendering frustum culling and dual CUDA streams. Empirically, CLM scales to up to 102.2M Gaussians on a single RTX 4090 with PSNR comparable to state-of-the-art methods and offers substantial memory savings and throughput improvements over baselines, demonstrating practical single-GPU scalability for 3DGS.

Abstract

3D Gaussian Splatting (3DGS) is an increasingly popular novel view synthesis approach due to its fast rendering time, and high-quality output. However, scaling 3DGS to large (or intricate) scenes is challenging due to its large memory requirement, which exceed most GPU's memory capacity. In this paper, we describe CLM, a system that allows 3DGS to render large scenes using a single consumer-grade GPU, e.g., RTX4090. It does so by offloading Gaussians to CPU memory, and loading them into GPU memory only when necessary. To reduce performance and communication overheads, CLM uses a novel offloading strategy that exploits observations about 3DGS's memory access pattern for pipelining, and thus overlap GPU-to-CPU communication, GPU computation and CPU computation. Furthermore, we also exploit observation about the access pattern to reduce communication volume. Our evaluation shows that the resulting implementation can render a large scene that requires 100 million Gaussians on a single RTX4090 and achieve state-of-the-art reconstruction quality.

CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

TL;DR

This work addresses the memory bottleneck of 3D Gaussian Splatting (3DGS) by introducing CLM, a sparsity-guided CPU offloading system that enables large-scale models to be trained and rendered on a single consumer GPU. CLM partitions Gaussian attributes into selection-critical (GPU-resident) and non-critical (CPU-resident) data, and leverages frustum culling, precise caching, microbatch pipelining, and a TSP-based scheduling strategy to hide data-transfer and optimizer overheads. Key contributions include attribute-wise offload, precise Gaussian caching, overlapped CPU Adam, and pipeline order optimization, all implemented with pre-rendering frustum culling and dual CUDA streams. Empirically, CLM scales to up to 102.2M Gaussians on a single RTX 4090 with PSNR comparable to state-of-the-art methods and offers substantial memory savings and throughput improvements over baselines, demonstrating practical single-GPU scalability for 3DGS.

Abstract

3D Gaussian Splatting (3DGS) is an increasingly popular novel view synthesis approach due to its fast rendering time, and high-quality output. However, scaling 3DGS to large (or intricate) scenes is challenging due to its large memory requirement, which exceed most GPU's memory capacity. In this paper, we describe CLM, a system that allows 3DGS to render large scenes using a single consumer-grade GPU, e.g., RTX4090. It does so by offloading Gaussians to CPU memory, and loading them into GPU memory only when necessary. To reduce performance and communication overheads, CLM uses a novel offloading strategy that exploits observations about 3DGS's memory access pattern for pipelining, and thus overlap GPU-to-CPU communication, GPU computation and CPU computation. Furthermore, we also exploit observation about the access pattern to reduce communication volume. Our evaluation shows that the resulting implementation can render a large scene that requires 100 million Gaussians on a single RTX4090 and achieve state-of-the-art reconstruction quality.

Paper Structure

This paper contains 32 sections, 15 figures, 7 tables.

Figures (15)

  • Figure 1: The novel view synthesis problem: given a set of training images (with known pose) from a scene, render the image from a novel view with an unrecorded camera position and orientation.
  • Figure 2: 3D Gaussian Splatting Illustration.
  • Figure 3: Runtime decomposition of one batch in naive offloading. It leads to overheads in communication and CPU Adam computation.
  • Figure 4: Frustum Culling: Gaussians outside of camera frustum will not be accessed when rendering the camera's view. Further, the Gaussians accessed when rendering a view are in the same region, i.e., the process exhibits spatial locality. Our approach uses these observations to improve performance and reduce GPU memory requirements. This results in a sparse memory access pattern to gaussians parameters. This also shows that 3DGS rendering has the property of spatial locality.
  • Figure 5: Empirical cumulative distribution functions (CDF) for the sparsity in Bicycle, Rubble, Alameda, Ithaca, and BigCity.
  • ...and 10 more figures