Table of Contents
Fetching ...

GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting

Andrew Bond, Jui-Hsien Wang, Long Mai, Erkut Erdem, Aykut Erdem

TL;DR

GaussianVideo tackles memory and training-time challenges in dynamic video representations by extending 3D Gaussian splatting to dynamic scenes and learning camera motion with Neural ODEs. It introduces cubic B-spline dynamics for Gaussian trajectories, a spatio-temporal hierarchical learning pipeline, and integrated camera parameter optimization within the rendering process. The method achieves state-of-the-art reconstruction quality with strong temporal coherence on DL3DV and DAVIS while maintaining competitive training speed and memory usage. This work enables robust dynamic scene representation and downstream video editing tasks such as frame interpolation and spatial resampling without heavy supervision.

Abstract

Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios.

GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting

TL;DR

GaussianVideo tackles memory and training-time challenges in dynamic video representations by extending 3D Gaussian splatting to dynamic scenes and learning camera motion with Neural ODEs. It introduces cubic B-spline dynamics for Gaussian trajectories, a spatio-temporal hierarchical learning pipeline, and integrated camera parameter optimization within the rendering process. The method achieves state-of-the-art reconstruction quality with strong temporal coherence on DL3DV and DAVIS while maintaining competitive training speed and memory usage. This work enables robust dynamic scene representation and downstream video editing tasks such as frame interpolation and spatial resampling without heavy supervision.

Abstract

Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios.
Paper Structure (16 sections, 3 equations, 9 figures, 2 tables)

This paper contains 16 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: A toy example illustrating the semantic and coherent motion of the underlying Gaussians. Rather than relying on strong supervisory signals like optical flow during training, we carefully design a Gaussian parametrization tailored for video modeling to naturally encourage this behavior, and demonstrate its effectiveness in downstream applications.
  • Figure 2: Overview of the GaussianVideo approach for neural video representation. Our method combines 3D Gaussian splatting with continuous camera motion modeling via Neural ODEs to handle dynamic scenes efficiently. The pipeline includes hierarchical learning strategies for both (a) spatial and (b) temporal domains, progressively refining Gaussians to capture fine details and smooth motion.
  • Figure 3: Polynomial basis functions, widely used for temporal modeling, can introduce reconstruction errors and instability, as shown in this example. (a) 3rd-order polynomial fails to fit the U-shape trajectory, while the commonly used 16th-order polynomial fits better, but is unstable and sensitive to small perturbations (see the bottom part of the "U"). In contrast, our cubic B-spline formulation (green dashed lines) fits the shape is unstable and sensitive to small perturbations due to its flexibility. (b) When using a low-order polynomial for motion fitting, the white ball following the "U" disappears entirely in the middle frame and reappears later, highlighting the instability of single-polynomial modeling.
  • Figure 4: Video reconstruction results on the DL3DV dataset, comparing GaussianVideo with alternative video representation models including NeRV, HNeRV, and GaussianImage. Each sequence shows reconstructed frames, highlighting GaussianVideo's ability to capture fine spatial details and structural fidelity, even in high-motion scenes.
  • Figure 5: Effect of Gaussian count on performance metrics.
  • ...and 4 more figures