GIViC: Generative Implicit Video Compression
Ge Gao, Siyue Teng, Tianhao Peng, Fan Zhang, David Bull
TL;DR
GIViC tackles the challenge of achieving state-of-the-art performance in INR-based video compression by integrating a conditional implicit diffusion model with a Hierarchical Gated Linear Attention transformer to enable full-GOP spatiotemporal modeling. The method decomposes diffusion across spatiotemporal pyramids and employs cross-scale conditioning, while HGLA provides linear-complexity long-range dependency modeling. The approach yields BD-rate savings over VTM, DCVC-FM, and NVRC under Random Access configurations, establishing INR-based video codecs as competitive with conventional standards. Although demonstrating strong compression, GIViC incurs higher computational complexity due to diffusion and transformer backbones, highlighting a trade-off between rate-distortion performance and real-time applicability.
Abstract
While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a Generative Implicit Video Compression framework, GIViC, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting long-term dependencies. Through the newly designed implicit diffusion process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel Hierarchical Gated Linear Attention-based transformer (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the first INR-based video codec that outperforms VTM based on the RA coding configuration. The source code will be made available.
