Table of Contents
Fetching ...

GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting

Inseo Lee, Youngyoon Choi, Joonseok Lee

TL;DR

GaussianVideo reframes video representation through deformable 2D Gaussian Splatting, coupling a multi-plane spatiotemporal encoder with a lightweight decoder to efficiently model dynamic frames. By conditioning Gaussian deformations on time and leveraging temporal gradients for initialization, the method dramatically reduces memory and speeds up training and decoding while preserving reconstruction quality. Key contributions include the deformable 2D Gaussian framework, a multi-plane encoder for scalable high-dimensional encoding, and a temporal-gradient initialization that prioritizes dynamic regions. The approach yields substantial practical benefits for video compression and real-time rendering, achieving competitive PSNR with significantly higher throughput and lower memory compared to state-of-the-art NeRV-based methods.

Abstract

Implicit Neural Representation for Videos (NeRV) has introduced a novel paradigm for video representation and compression, outperforming traditional codecs. As model size grows, however, slow encoding and decoding speed and high memory consumption hinder its application in practice. To address these limitations, we propose a new video representation and compression method based on 2D Gaussian Splatting to efficiently handle video data. Our proposed deformable 2D Gaussian Splatting dynamically adapts the transformation of 2D Gaussians at each frame, significantly reducing memory cost. Equipped with a multi-plane-based spatiotemporal encoder and a lightweight decoder, it predicts changes in color, coordinates, and shape of initialized Gaussians, given the time step. By leveraging temporal gradients, our model effectively captures temporal redundancy at negligible cost, significantly enhancing video representation efficiency. Our method reduces GPU memory usage by up to 78.4%, and significantly expedites video processing, achieving 5.5x faster training and 12.5x faster decoding compared to the state-of-the-art NeRV methods.

GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting

TL;DR

GaussianVideo reframes video representation through deformable 2D Gaussian Splatting, coupling a multi-plane spatiotemporal encoder with a lightweight decoder to efficiently model dynamic frames. By conditioning Gaussian deformations on time and leveraging temporal gradients for initialization, the method dramatically reduces memory and speeds up training and decoding while preserving reconstruction quality. Key contributions include the deformable 2D Gaussian framework, a multi-plane encoder for scalable high-dimensional encoding, and a temporal-gradient initialization that prioritizes dynamic regions. The approach yields substantial practical benefits for video compression and real-time rendering, achieving competitive PSNR with significantly higher throughput and lower memory compared to state-of-the-art NeRV-based methods.

Abstract

Implicit Neural Representation for Videos (NeRV) has introduced a novel paradigm for video representation and compression, outperforming traditional codecs. As model size grows, however, slow encoding and decoding speed and high memory consumption hinder its application in practice. To address these limitations, we propose a new video representation and compression method based on 2D Gaussian Splatting to efficiently handle video data. Our proposed deformable 2D Gaussian Splatting dynamically adapts the transformation of 2D Gaussians at each frame, significantly reducing memory cost. Equipped with a multi-plane-based spatiotemporal encoder and a lightweight decoder, it predicts changes in color, coordinates, and shape of initialized Gaussians, given the time step. By leveraging temporal gradients, our model effectively captures temporal redundancy at negligible cost, significantly enhancing video representation efficiency. Our method reduces GPU memory usage by up to 78.4%, and significantly expedites video processing, achieving 5.5x faster training and 12.5x faster decoding compared to the state-of-the-art NeRV methods.

Paper Structure

This paper contains 17 sections, 14 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (Left) Demonstration of the proposed GaussianVideo method on the Bunny Dataset, learning video representation using Gaussian Splatting. (Right) Experimental comparison with several state-of-the-art NeRV methods, demonstrating superior rendering speed (FPS) and competitive reconstruction quality (PSNR) of our method, with the least training time (represented as the circle size).
  • Figure 2: Overview of our GaussianVideo. The encoder $\mathcal{E}$ takes $N$ 2D Gaussians and a time step $t$ as input and produces a feature $F$ fusing the spatio-temporal information. From this feature, a lightweight decoder $\mathcal{D}$ reconstructs the deformations for each Gaussian and renders the image.
  • Figure 3: Temporal gradient-based Initialization. Brighter areas indicate regions with higher variation. We initialize more Gaussians in these brighter regions.
  • Figure 4: A Qualitative Example on DAVIS. This example is from all models with size 0.75M.
  • Figure 5: PSNR results over Epochs. A models with a size of 0.35M are trained over 900 epochs.
  • ...and 2 more figures