Table of Contents
Fetching ...

Representing Long Volumetric Video with Temporal Gaussian Hierarchy

Zhen Xu, Yinghao Xu, Zhiyuan Yu, Sida Peng, Jiaming Sun, Hujun Bao, Xiaowei Zhou

TL;DR

This work introduces Temporal Gaussian Hierarchy (TGH), a multi-level 4D representation that compactly models long volumetric videos by distributing Gaussians across temporal segments corresponding to different motion scales, enabling near-constant GPU memory usage regardless of video length. A Compact Appearance Model using sparse spherical harmonics and a hardware-accelerated rasterization pipeline further reduce storage and accelerating rendering without sacrificing quality. The approach demonstrates state-of-the-art rendering quality with dramatically lower training costs and memory usage on minutes-long sequences (e.g., 18,000 frames) and real-time 1080p rendering at 450 FPS on a RTX 4090. By enabling efficient handling of long volumetric videos, the method advances dynamic view synthesis for AR/VR, gaming, and telepresence applications, while maintaining high fidelity across challenging scenes. The work also provides a dataset, SelfCap, to facilitate research in long-duration volumetric video modeling.

Abstract

This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality. Our project page is available at: https://zju3dv.github.io/longvolcap.

Representing Long Volumetric Video with Temporal Gaussian Hierarchy

TL;DR

This work introduces Temporal Gaussian Hierarchy (TGH), a multi-level 4D representation that compactly models long volumetric videos by distributing Gaussians across temporal segments corresponding to different motion scales, enabling near-constant GPU memory usage regardless of video length. A Compact Appearance Model using sparse spherical harmonics and a hardware-accelerated rasterization pipeline further reduce storage and accelerating rendering without sacrificing quality. The approach demonstrates state-of-the-art rendering quality with dramatically lower training costs and memory usage on minutes-long sequences (e.g., 18,000 frames) and real-time 1080p rendering at 450 FPS on a RTX 4090. By enabling efficient handling of long volumetric videos, the method advances dynamic view synthesis for AR/VR, gaming, and telepresence applications, while maintaining high fidelity across challenging scenes. The work also provides a dataset, SelfCap, to facilitate research in long-duration volumetric video modeling.

Abstract

This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality. Our project page is available at: https://zju3dv.github.io/longvolcap.

Paper Structure

This paper contains 40 sections, 13 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Training cost v.s. number of video frames. By varying the number of video frames, we compare our method with recent state-of-the-art volumetric video techniques on the Neural3DV li2022neural dataset in terms of training cost and storage usage, measured using VRAM (GB) and model size (MB), respectively. Each bubble’s area is proportional to its storage usage. Our method consistently maintains a constant training cost and near-constant storage usage regardless of the video frame length, demonstrating the scalability of our method for very long volumetric videos.
  • Figure 2: Overview of our framework. Given a long multi-view video sequence, our method can generate a compact volumetric video with minimal training and memory usage while maintaining real-time rendering with state-of-the-art quality. (a) We propose a hierarchical structure where each level consists of multiple temporal segments. Each segment stores a set of 4D Gaussians yang2023realtime to parametrize scenes. As shown at the bottom, the 4D Gaussians in different segments represent different granularities of motions, efficiently and effectively modeling video dynamics. (b) The appearance model leverages gradient thresholding to obtain sparse Spherical Harmonics coefficients, resulting in very compact storage while still maintaining view-dependent effects well.
  • Figure 3: Qualitative comparisons on Mobile-Stagexu20244k4d with 1600 frames. For long videos of 1200 frames, our model can be directly trained on the whole sequence and only requires 10.2GB of VRAM for training and 0.42GB of storage, which is 2x and 4x less, respectively, compared to the second smallest implicit method, ENeRF lin2022efficient. Contrarily, 4K4D xu20244k4d and 4DGS yang2023realtime could only be trained on small segments of 300 frames without encountering Out-of-Memory error. Our model achieves high rendering quality and can be rendered at 440 FPS.
  • Figure 4: Qualitative comparisons on Neural3DVli2022neural with 1200 frames. Our method can not only recover high-frequency details of dynamic objects but also maintain the sharp appearance of the background with low training costs and a compact model size. 4K4D xu20244k4d and 4DGS yang2023realtime could only be trained on small segments of 300 frames due to VRAM limitation.
  • Figure 5: Qualitative comparisons on ENeRF-Outdoorlin2022efficient with 1200 frames. Here we show multiple sequences for comparison. Our method achieves high-quality rendering while using only 50% of the VRAM and 40% of the storage required by ENeRF, and it is 1.6x faster than the second best method, 4K4D. Note that 4K4D xu20244k4d and 4DGS yang2023realtime encounters Out-of-Memory error for sequences longer than 300 frames.
  • ...and 8 more figures