V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

Penghao Wang; Zhirui Zhang; Liao Wang; Kaixin Yao; Siyuan Xie; Jingyi Yu; Minye Wu; Lan Xu

V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

Penghao Wang, Zhirui Zhang, Liao Wang, Kaixin Yao, Siyuan Xie, Jingyi Yu, Minye Wu, Lan Xu

TL;DR

This paper introduces V3(Viewing Volumetric Videos), a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians, and designed a multi-platform player to decode and render 2D Gaussian videos.

Abstract

Experiencing high-fidelity volumetric video as seamlessly as 2D videos is a long-held dream. However, current dynamic 3DGS methods, despite their high rendering quality, face challenges in streaming on mobile devices due to computational and bandwidth constraints. In this paper, we introduce V^3 (Viewing Volumetric Videos), a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians. Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs. Additionally, we propose a two-stage training strategy to reduce storage requirements with rapid training speed. The first stage employs hash encoding and shallow MLP to learn motion, then reduces the number of Gaussians through pruning to meet the streaming requirements, while the second stage fine tunes other Gaussian attributes using residual entropy loss and temporal loss to improve temporal continuity. This strategy, which disentangles motion and appearance, maintains high rendering quality with compact storage requirements. Meanwhile, we designed a multi-platform player to decode and render 2D Gaussian videos. Extensive experiments demonstrate the effectiveness of V^3, outperforming other methods by enabling high-quality rendering and streaming on common devices, which is unseen before. As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience, including smooth scrolling and instant sharing. Our project page with source code is available at https://authoritywang.github.io/v3/.

V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

TL;DR

Abstract

Paper Structure (31 sections, 13 equations, 12 figures, 4 tables)

This paper contains 31 sections, 13 equations, 12 figures, 4 tables.

Introduction
Related Work
Novel View Synthesis for Dynamic Scenes.
Efficient Radiance Field.
Cross Device Neural Radiance Field Rendering.
Streamable Volumetric Video
V3 Representation
V3 Reconstruction
Grouped V3 Training
Key Frame Reconstruction
Fast Motion Estimation.
Temporal Regularization
Residual Entropy Loss
Temporal Loss.
Total Loss.
...and 16 more sections

Figures (12)

Figure 1: We model dynamic 3DGS as a 2D video with multiple dimensions, where each frame corresponds to its specific 3DGS attributes. During the rendering, we extract Gaussian properties from each pixel to recover Gaussian Splat structural.
Figure 2: Overview of V3 training. For a frame group, we select the first frame as the keyframe and reconstruct it with a prune fine tune strategy to control the number of Gaussians. For other frames in the frame group, we employ the sequential two-stage training strategy for each frame to get the per-frame 3DGS model.
Figure 3: Keyframe training. Our keyframe uses the triangle mesh generated by NeuS2 wang2023neus2 as the initial point cloud and then constructs the Gaussian Splatting model. To make our representation more compact, we further prune the Gaussians according to opacity and fine tune. By iterative pruning and fine tuning, we can efficiently control the storage of our model.
Figure 4: Two-stage training. First, we divide the long sequences into groups for training. In the first stage, we use hash encoding following a shallow MLP with position as input to estimate the motion of the human subjects. In the second stage, we fine tune the attributes of the warped Gaussians from stage 1 with residual entropy loss and temporal loss, which yields 2D Gaussian video with high temporal consistency and thus we can use a video codec to perform efficient compression.
Figure 5: Analysis of the residual Gaussian attribute distribution revealed that the residuals in appearance, scale, and rotation exhibit Gaussian characteristics.
...and 7 more figures

V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

TL;DR

Abstract

V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

Authors

TL;DR

Abstract

Table of Contents

Figures (12)