EvolvingGS: High-Fidelity Streamable Volumetric Video via Evolving 3D Gaussian Representation
Chao Zhang, Yifeng Zhou, Shuheng Wang, Wenfa Li, Degang Wang, Yi Xu, Shaohui Jiao
TL;DR
The paper tackles the challenge of long-sequence dynamic scene reconstruction with complex motions by avoiding GoP-based segmentation and adopting an explicit evolving 3D Gaussian representation. It introduces EvolvingGS, a two-stage pipeline consisting of a Warping Stage for coarse, flow-guided alignment using sparse control points and a Detail Refinement Stage that spawns and prunes Gaussians to handle topology changes while keeping appearance features temporally coherent; a differential temporal encoding scheme further compresses the evolving model. The approach leverages a two-stream refinement (reference and extension Gaussians) with a contribution-based pruning metric to control model growth, and integrates an adaptive iteration strategy to balance quality and efficiency. Experiments on public and challenging custom datasets show state-of-the-art reconstruction quality for extended GoP lengths and achieve over 50x compression, demonstrating substantial practical impact for streaming dynamic scenes with high fidelity.
Abstract
We have recently seen great progress in 3D scene reconstruction through explicit point-based 3D Gaussian Splatting (3DGS), notable for its high quality and fast rendering speed. However, reconstructing dynamic scenes such as complex human performances with long durations remains challenging. Prior efforts fall short of modeling a long-term sequence with drastic motions, frequent topology changes or interactions with props, and resort to segmenting the whole sequence into groups of frames that are processed independently, which undermines temporal stability and thereby leads to an unpleasant viewing experience and inefficient storage footprint. In view of this, we introduce EvolvingGS, a two-stage strategy that first deforms the Gaussian model to coarsely align with the target frame, and then refines it with minimal point addition/subtraction, particularly in fast-changing areas. Owing to the flexibility of the incrementally evolving representation, our method outperforms existing approaches in terms of both per-frame and temporal quality metrics while maintaining fast rendering through its purely explicit representation. Moreover, by exploiting temporal coherence between successive frames, we propose a simple yet effective compression algorithm that achieves over 50x compression rate. Extensive experiments on both public benchmarks and challenging custom datasets demonstrate that our method significantly advances the state-of-the-art in dynamic scene reconstruction, particularly for extended sequences with complex human performances.
