Lightweight Predictive 3D Gaussian Splats
Junli Cao, Vidit Goel, Chaoyang Wang, Anil Kag, Ju Hu, Sergei Korolev, Chenfanfu Jiang, Sergey Tulyakov, Jian Ren
TL;DR
The paper tackles the storage bottleneck of large-scale 3D Gaussian Splat representations by introducing a lightweight predictive framework that stores only a subset of 'parent' splats and predicts the attributes of nearby 'child' splats during rendering. It represents scenes as a forest of depth-1 trees where child positions satisfy $x_k = x_p + g_{pos}(f_\Delta)[k]$ and attributes are inferred through a hash-grid $\mathcal{H}$ and a self-attention fusion over features, with shared MLPs predicting scale, rotation, color, and opacity. Training optimizes image fidelity with a loss $\mathcal{L} = (1 - \beta)\mathcal{L}_1 + \beta \mathcal{L}_{\mathrm{D-SSIM}}$ and uses a warm-up schedule to stabilize learning. Experiments on mip-nerf360, Tanks&Temples, and Deep Blending show up to ~19–20x storage reduction while achieving or exceeding PSNR compared to larger baselines, enabling on-device real-time rendering and broad practical deployment.
Abstract
Recent approaches representing 3D objects and scenes using Gaussian splats show increased rendering speed across a variety of platforms and devices. While rendering such representations is indeed extremely efficient, storing and transmitting them is often prohibitively expensive. To represent large-scale scenes, one often needs to store millions of 3D Gaussians, occupying gigabytes of disk space. This poses a very practical limitation, prohibiting widespread adoption.Several solutions have been proposed to strike a balance between disk size and rendering quality, noticeably reducing the visual quality. In this work, we propose a new representation that dramatically reduces the hard drive footprint while featuring similar or improved quality when compared to the standard 3D Gaussian splats. When compared to other compact solutions, ours offers higher quality renderings with significantly reduced storage, being able to efficiently run on a mobile device in real-time. Our key observation is that nearby points in the scene can share similar representations. Hence, only a small ratio of 3D points needs to be stored. We introduce an approach to identify such points which are called parent points. The discarded points called children points along with attributes can be efficiently predicted by tiny MLPs.
