Table of Contents
Fetching ...

QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos

Sharath Girish, Tianye Li, Amrita Mazumdar, Abhinav Shrivastava, David Luebke, Shalini De Mello

TL;DR

QUEEN addresses the challenge of streaming free-viewpoint video by modeling dynamic scenes with 3D Gaussian Splatting and learning Gaussian attribute residuals across time. It introduces a learned quantization-sparsity framework that compresses residuals, along with a gating mechanism for position residuals, and uses viewspace gradient difference to adapt training and focus on highly dynamic regions. The approach achieves state-of-the-art results among online FVV methods, delivering higher reconstruction quality with dramatically reduced per-frame memory, along with faster training and rendering. This combination enables real-time, bandwidth-efficient streaming of dynamic 3D scenes with strong generalization across diverse content.

Abstract

Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored. It requires incremental on-the-fly updates to a volumetric representation, fast training and rendering to satisfy real-time constraints and a small memory footprint for efficient transmission. If achieved, it can enhance user experience by enabling novel applications, e.g., 3D video conferencing and live volumetric video broadcast, among others. In this work, we propose a novel framework for QUantized and Efficient ENcoding (QUEEN) for streaming FVV using 3D Gaussian Splatting (3D-GS). QUEEN directly learns Gaussian attribute residuals between consecutive frames at each time-step without imposing any structural constraints on them, allowing for high quality reconstruction and generalizability. To efficiently store the residuals, we further propose a quantization-sparsity framework, which contains a learned latent-decoder for effectively quantizing attribute residuals other than Gaussian positions and a learned gating module to sparsify position residuals. We propose to use the Gaussian viewspace gradient difference vector as a signal to separate the static and dynamic content of the scene. It acts as a guide for effective sparsity learning and speeds up training. On diverse FVV benchmarks, QUEEN outperforms the state-of-the-art online FVV methods on all metrics. Notably, for several highly dynamic scenes, it reduces the model size to just 0.7 MB per frame while training in under 5 sec and rendering at 350 FPS. Project website is at https://research.nvidia.com/labs/amri/projects/queen

QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos

TL;DR

QUEEN addresses the challenge of streaming free-viewpoint video by modeling dynamic scenes with 3D Gaussian Splatting and learning Gaussian attribute residuals across time. It introduces a learned quantization-sparsity framework that compresses residuals, along with a gating mechanism for position residuals, and uses viewspace gradient difference to adapt training and focus on highly dynamic regions. The approach achieves state-of-the-art results among online FVV methods, delivering higher reconstruction quality with dramatically reduced per-frame memory, along with faster training and rendering. This combination enables real-time, bandwidth-efficient streaming of dynamic 3D scenes with strong generalization across diverse content.

Abstract

Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored. It requires incremental on-the-fly updates to a volumetric representation, fast training and rendering to satisfy real-time constraints and a small memory footprint for efficient transmission. If achieved, it can enhance user experience by enabling novel applications, e.g., 3D video conferencing and live volumetric video broadcast, among others. In this work, we propose a novel framework for QUantized and Efficient ENcoding (QUEEN) for streaming FVV using 3D Gaussian Splatting (3D-GS). QUEEN directly learns Gaussian attribute residuals between consecutive frames at each time-step without imposing any structural constraints on them, allowing for high quality reconstruction and generalizability. To efficiently store the residuals, we further propose a quantization-sparsity framework, which contains a learned latent-decoder for effectively quantizing attribute residuals other than Gaussian positions and a learned gating module to sparsify position residuals. We propose to use the Gaussian viewspace gradient difference vector as a signal to separate the static and dynamic content of the scene. It acts as a guide for effective sparsity learning and speeds up training. On diverse FVV benchmarks, QUEEN outperforms the state-of-the-art online FVV methods on all metrics. Notably, for several highly dynamic scenes, it reduces the model size to just 0.7 MB per frame while training in under 5 sec and rendering at 350 FPS. Project website is at https://research.nvidia.com/labs/amri/projects/queen

Paper Structure

This paper contains 36 sections, 12 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Overview of QUEEN for online FVV. We incrementally update Gaussian attributes at each time-step (gray block) by simultaneously learning and compressing residuals between consecutive time-steps via a quantization (orange block) and sparsity (yellow block) framework. We additionally render only the dynamic Gaussians for masked regions to achieve faster convergence (green block).
  • Figure 2: Viewspace Gradient Difference. We use the difference of viewspace gradients between consecutive frames to identify dynamic scene content.
  • Figure 3: Qualitative Results. A visualization of various scenes in the N3DV and Immersive datasets. PSNR ($\uparrow$) values are shown. We include additional video results in the supplement.
  • Figure 4: Effect of Updating Appearance Attributes. QUEEN updates all Gaussian attributes, resulting in improved quality versus keeping appearance attributes fixed across a video.
  • Figure 5: Effect of Gating. While a large number of gates ($47\%$) are active at start of training (a, c), they are pruned and only gates corresponding to changing scene content ($2\%$) remain active (b, d).
  • ...and 6 more figures