CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting
Hexu Zhao, Xiwen Min, Xiaoteng Liu, Moonjun Gong, Yiming Li, Ang Li, Saining Xie, Jinyang Li, Aurojit Panda
TL;DR
This work addresses the memory bottleneck of 3D Gaussian Splatting (3DGS) by introducing CLM, a sparsity-guided CPU offloading system that enables large-scale models to be trained and rendered on a single consumer GPU. CLM partitions Gaussian attributes into selection-critical (GPU-resident) and non-critical (CPU-resident) data, and leverages frustum culling, precise caching, microbatch pipelining, and a TSP-based scheduling strategy to hide data-transfer and optimizer overheads. Key contributions include attribute-wise offload, precise Gaussian caching, overlapped CPU Adam, and pipeline order optimization, all implemented with pre-rendering frustum culling and dual CUDA streams. Empirically, CLM scales to up to 102.2M Gaussians on a single RTX 4090 with PSNR comparable to state-of-the-art methods and offers substantial memory savings and throughput improvements over baselines, demonstrating practical single-GPU scalability for 3DGS.
Abstract
3D Gaussian Splatting (3DGS) is an increasingly popular novel view synthesis approach due to its fast rendering time, and high-quality output. However, scaling 3DGS to large (or intricate) scenes is challenging due to its large memory requirement, which exceed most GPU's memory capacity. In this paper, we describe CLM, a system that allows 3DGS to render large scenes using a single consumer-grade GPU, e.g., RTX4090. It does so by offloading Gaussians to CPU memory, and loading them into GPU memory only when necessary. To reduce performance and communication overheads, CLM uses a novel offloading strategy that exploits observations about 3DGS's memory access pattern for pipelining, and thus overlap GPU-to-CPU communication, GPU computation and CPU computation. Furthermore, we also exploit observation about the access pattern to reduce communication volume. Our evaluation shows that the resulting implementation can render a large scene that requires 100 million Gaussians on a single RTX4090 and achieve state-of-the-art reconstruction quality.
