D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS
Mufan Liu, Qi Yang, Miaoran Zhao, He Huang, Le Yang, Zhu Li, Yiling Xu
TL;DR
D2GV presents a deformable 2D Gaussian splatting framework for video, enabling GoP-level training to achieve linear scalability and decoding speeds over 400 FPS while delivering competitive or superior quality to state-of-the-art INRs. By learning a canonical 2D Gaussian space and a lightweight deformation field, it effectively regresses video content with explicit, interpretable primitives and CUDA-accelerated rasterization. The approach includes learnable pruning and quantization to compress the representation without substantial quality loss, and demonstrates versatility across video interpolation, inpainting, and denoising. Empirical results on Big Buck Bunny, UVG, and DAVIS show faster training, higher FPS, and robust downstream performance, highlighting D2GV’s potential as a practical GS-based alternative for video representation and processing.
Abstract
Implicit Neural Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting. However, their implicit formulation limits both interpretability and efficacy, undermining their practicality as a comprehensive solution. We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV, which aims to achieve three key objectives: 1) improved efficiency while delivering superior quality; 2) enhanced scalability and interpretability; and 3) increased friendliness for downstream tasks. Specifically, we initially divide the video sequence into fixed-length Groups of Pictures (GoP) to allow parallel training and linear scalability with video length. For each GoP, D2GV represents video frames by applying differentiable rasterization to 2D Gaussians, which are deformed from a canonical space into their corresponding timestamps. Notably, leveraging efficient CUDA-based rasterization, D2GV converges fast and decodes at speeds exceeding 400 FPS, while delivering quality that matches or surpasses state-of-the-art INRs. Moreover, we incorporate a learnable pruning and quantization strategy to streamline D2GV into a more compact representation. We demonstrate D2GV's versatility in tasks including video interpolation, inpainting and denoising, underscoring its potential as a promising solution for video representation. Code is available at: https://github.com/Evan-sudo/D2GV.
