Table of Contents
Fetching ...

StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video

Zhihui Ke, Yuyang Liu, Xiaobo Zhou, Tie Qiu

TL;DR

StreamSTGS tackles real-time streaming of free-viewpoint video by decoupling dynamic 3D Gaussian Splatting into canonical Gaussians, temporal features, and a deformation field, enabling the canonical attributes to be stored as 2D images and temporal features as a video for adaptive bitrate rendering. A sliding window aggregates local temporal motions, while a transformer-guided auxiliary training module learns global motions without sacrificing rendering speed, facilitated by a GOP-based training regime. The method introduces dynamic-aware density, Gaussian relocation, and a comprehensive optimization objective, achieving about a $1$ dB PSNR improvement while reducing per-frame storage to roughly $170$ KB and delivering competitive rendering performance. These contributions enable real-time streaming of high-quality FVV with adaptive bitrate control, reducing bandwidth requirements without requiring retuning or retraining for different network conditions.

Abstract

Streaming free-viewpoint video~(FVV) in real-time still faces significant challenges, particularly in training, rendering, and transmission efficiency. Harnessing superior performance of 3D Gaussian Splatting~(3DGS), recent 3DGS-based FVV methods have achieved notable breakthroughs in both training and rendering. However, the storage requirements of these methods can reach up to $10$MB per frame, making stream FVV in real-time impossible. To address this problem, we propose a novel FVV representation, dubbed StreamSTGS, designed for real-time streaming. StreamSTGS represents a dynamic scene using canonical 3D Gaussians, temporal features, and a deformation field. For high compression efficiency, we encode canonical Gaussian attributes as 2D images and temporal features as a video. This design not only enables real-time streaming, but also inherently supports adaptive bitrate control based on network condition without any extra training. Moreover, we propose a sliding window scheme to aggregate adjacent temporal features to learn local motions, and then introduce a transformer-guided auxiliary training module to learn global motions. On diverse FVV benchmarks, StreamSTGS demonstrates competitive performance on all metrics compared to state-of-the-art methods. Notably, StreamSTGS increases the PSNR by an average of $1$dB while reducing the average frame size to just $170$KB. The code is publicly available on https://github.com/kkkzh/StreamSTGS.

StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video

TL;DR

StreamSTGS tackles real-time streaming of free-viewpoint video by decoupling dynamic 3D Gaussian Splatting into canonical Gaussians, temporal features, and a deformation field, enabling the canonical attributes to be stored as 2D images and temporal features as a video for adaptive bitrate rendering. A sliding window aggregates local temporal motions, while a transformer-guided auxiliary training module learns global motions without sacrificing rendering speed, facilitated by a GOP-based training regime. The method introduces dynamic-aware density, Gaussian relocation, and a comprehensive optimization objective, achieving about a dB PSNR improvement while reducing per-frame storage to roughly KB and delivering competitive rendering performance. These contributions enable real-time streaming of high-quality FVV with adaptive bitrate control, reducing bandwidth requirements without requiring retuning or retraining for different network conditions.

Abstract

Streaming free-viewpoint video~(FVV) in real-time still faces significant challenges, particularly in training, rendering, and transmission efficiency. Harnessing superior performance of 3D Gaussian Splatting~(3DGS), recent 3DGS-based FVV methods have achieved notable breakthroughs in both training and rendering. However, the storage requirements of these methods can reach up to MB per frame, making stream FVV in real-time impossible. To address this problem, we propose a novel FVV representation, dubbed StreamSTGS, designed for real-time streaming. StreamSTGS represents a dynamic scene using canonical 3D Gaussians, temporal features, and a deformation field. For high compression efficiency, we encode canonical Gaussian attributes as 2D images and temporal features as a video. This design not only enables real-time streaming, but also inherently supports adaptive bitrate control based on network condition without any extra training. Moreover, we propose a sliding window scheme to aggregate adjacent temporal features to learn local motions, and then introduce a transformer-guided auxiliary training module to learn global motions. On diverse FVV benchmarks, StreamSTGS demonstrates competitive performance on all metrics compared to state-of-the-art methods. Notably, StreamSTGS increases the PSNR by an average of dB while reducing the average frame size to just KB. The code is publicly available on https://github.com/kkkzh/StreamSTGS.

Paper Structure

This paper contains 23 sections, 13 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Overview of the StreamSTGS framework. First, the long video sequence is split into multiple groups. Within each group, a sliding window is employed to extract multiple temporal features, which serve as inputs to the deformation field for predicting the deformation of canonical 3D Gaussians. For real-time streaming, the canonical Gaussians are represented as images, while the temporal features are encoded as a video. To improve global motion learning, we introduce a Transformer-guided auxiliary training strategy, which can be removed during inference to achieve higher FPS.
  • Figure 2: The 2D grid representation of our StreamSTGS. Canonical Gaussian attributes are compressed as 2D images and temporal features are encoded as a video for real-time streaming.
  • Figure 3: Qualitative comparison of ours with the benchmark methods on Cook Spinach scene of N3DV dataset.
  • Figure 4: Qualitative comparison of ours with the benchmark methods on Trimming scene of MeetRoom dataset.
  • Figure 5: Ablation experiments on Cut Roasted Beef scene of N3DV dataset.
  • ...and 13 more figures