Table of Contents
Fetching ...

GIViC: Generative Implicit Video Compression

Ge Gao, Siyue Teng, Tianhao Peng, Fan Zhang, David Bull

TL;DR

GIViC tackles the challenge of achieving state-of-the-art performance in INR-based video compression by integrating a conditional implicit diffusion model with a Hierarchical Gated Linear Attention transformer to enable full-GOP spatiotemporal modeling. The method decomposes diffusion across spatiotemporal pyramids and employs cross-scale conditioning, while HGLA provides linear-complexity long-range dependency modeling. The approach yields BD-rate savings over VTM, DCVC-FM, and NVRC under Random Access configurations, establishing INR-based video codecs as competitive with conventional standards. Although demonstrating strong compression, GIViC incurs higher computational complexity due to diffusion and transformer backbones, highlighting a trade-off between rate-distortion performance and real-time applicability.

Abstract

While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a Generative Implicit Video Compression framework, GIViC, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting long-term dependencies. Through the newly designed implicit diffusion process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel Hierarchical Gated Linear Attention-based transformer (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the first INR-based video codec that outperforms VTM based on the RA coding configuration. The source code will be made available.

GIViC: Generative Implicit Video Compression

TL;DR

GIViC tackles the challenge of achieving state-of-the-art performance in INR-based video compression by integrating a conditional implicit diffusion model with a Hierarchical Gated Linear Attention transformer to enable full-GOP spatiotemporal modeling. The method decomposes diffusion across spatiotemporal pyramids and employs cross-scale conditioning, while HGLA provides linear-complexity long-range dependency modeling. The approach yields BD-rate savings over VTM, DCVC-FM, and NVRC under Random Access configurations, establishing INR-based video codecs as competitive with conventional standards. Although demonstrating strong compression, GIViC incurs higher computational complexity due to diffusion and transformer backbones, highlighting a trade-off between rate-distortion performance and real-time applicability.

Abstract

While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a Generative Implicit Video Compression framework, GIViC, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting long-term dependencies. Through the newly designed implicit diffusion process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel Hierarchical Gated Linear Attention-based transformer (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the first INR-based video codec that outperforms VTM based on the RA coding configuration. The source code will be made available.

Paper Structure

This paper contains 12 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (Top) Illustration of the implicit diffusion framework based on spatiotemporal downsampling of a GOP $\bm{\mathcal{X}}$ with additive noise, interlinking independent diffusion within constant-sized tokens $\{\bm{x}^k_{i,j}\}$ across $k = 1, \dots, K$ levels of abstractions. (Bottom) The global spatiotemporal dependencies are captured by the 2D hidden states $\mathbf{S}^{\textcolor{red}{k}}_{\textcolor{orange}{i,j}}$ of the HGLA transformer, recurrently updated along both scale and sequence axes.
  • Figure 2: Illustration of the GIViC network architecture.
  • Figure 3: Illustration of cross-resolution consistency training.
  • Figure 4: (A) Rate-distortion curves on UVG, MCL-JCV, and JVET-B datasets. (B) Reconstruction quality PSNR w.r.t diffusive sampling steps for low bitrate range (solid lines) and high bitrate range (dashed line) respectively. (C) BD-rate (PSNR, solid lines) and decoding complexity (dashed lines) w.r.t context length.
  • Figure 5: Visual comparison of reconstructions by different video codec baselines, where we report the average sequence bpp and the corresponding frame's PSNR.