Representing Long Volumetric Video with Temporal Gaussian Hierarchy
Zhen Xu, Yinghao Xu, Zhiyuan Yu, Sida Peng, Jiaming Sun, Hujun Bao, Xiaowei Zhou
TL;DR
This work introduces Temporal Gaussian Hierarchy (TGH), a multi-level 4D representation that compactly models long volumetric videos by distributing Gaussians across temporal segments corresponding to different motion scales, enabling near-constant GPU memory usage regardless of video length. A Compact Appearance Model using sparse spherical harmonics and a hardware-accelerated rasterization pipeline further reduce storage and accelerating rendering without sacrificing quality. The approach demonstrates state-of-the-art rendering quality with dramatically lower training costs and memory usage on minutes-long sequences (e.g., 18,000 frames) and real-time 1080p rendering at 450 FPS on a RTX 4090. By enabling efficient handling of long volumetric videos, the method advances dynamic view synthesis for AR/VR, gaming, and telepresence applications, while maintaining high fidelity across challenging scenes. The work also provides a dataset, SelfCap, to facilitate research in long-duration volumetric video modeling.
Abstract
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality. Our project page is available at: https://zju3dv.github.io/longvolcap.
