Table of Contents
Fetching ...

Generative Neural Video Compression via Video Diffusion Prior

Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma

TL;DR

The paper tackles ultra-low bitrate perceptual video compression and temporal flicker.It proposes GNVC-VD, a framework that uses a pre-trained video diffusion transformer to perform joint spatio-temporal latent compression and sequence-level refinement.Two-stage training aligns the diffusion prior with compressed latents and then fine-tunes in the pixel domain, with a conditioning adaptor to maintain temporal coherence.Experiments show state-of-the-art perceptual quality below 0.03 bpp and significantly reduced flicker compared with traditional, learned, and image-prior generative codecs, highlighting the potential of video-native priors for next-generation perceptual video compression.

Abstract

We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

Generative Neural Video Compression via Video Diffusion Prior

TL;DR

The paper tackles ultra-low bitrate perceptual video compression and temporal flicker.It proposes GNVC-VD, a framework that uses a pre-trained video diffusion transformer to perform joint spatio-temporal latent compression and sequence-level refinement.Two-stage training aligns the diffusion prior with compressed latents and then fine-tunes in the pixel domain, with a conditioning adaptor to maintain temporal coherence.Experiments show state-of-the-art perceptual quality below 0.03 bpp and significantly reduced flicker compared with traditional, learned, and image-prior generative codecs, highlighting the potential of video-native priors for next-generation perceptual video compression.

Abstract

We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

Paper Structure

This paper contains 23 sections, 12 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Qualitative comparison on ultra-low bitrate video compression. Traditional and learned codecs produce blurry frames. Generative approaches such as GLC-Video qi2025generative yield sharper textures but introduce structural hallucinations and unstable details, causing pronounced temporal flickering (see Fig. \ref{['fig:illustration']}). Leveraging a video-native diffusion prior, GNVC-VD produces coherent fine textures with strong temporal stability. Zoom in for best view.
  • Figure 2: (a) Spatial and $t$--$x$ comparisons. Traditional and learned codecs lose fine textures, while GLC-Video qi2025generative exhibits sharp but unstable structures that cause temporal flickering. GNVC-VD preserves clean textures and stable motion. (b) Frame-wise warp error $E_{\text{warp}}$ further confirms GNVC-VD’s temporal stability, in contrast to the large fluctuations of GLC-Video.
  • Figure 3: Overview of the proposed GNVC-VD framework. (a) Overall pipeline composed of two key modules: (b) a Contextual Latent Codec for spatio-temporal latent compression (Section \ref{['sec:compression']}), and (c) a VideoDiT-based refinement module that performs flow-matching latent refinement (Section \ref{['sec:refinement']}).
  • Figure 4: Rate–distortion curves on the HEVC-B flynn16common, UVG UVG, and MCL-JCV MCL-JCV in the ultra-low bitrate regime ($<0.03$ bpp). We report perceptual quality in terms of LPIPS and DISTS in the ultra-low bitrate regime ($<0.03$ bpp). GNVC-VD consistently achieves the best perceptual quality, clearly outperforming traditional codecs (HEVC, VVC), learned codecs (DCVC-FM, DCVC-RT), and generative baselines (GLC-Video).
  • Figure 5: Qualitative comparison across different codecs at ultra-low bitrates. Compared with traditional, learned, and prior generative codecs, GNVC-VD preserves finer structures. More visual examples are available in the Appendix Section \ref{['sec:additional_visual']}.
  • ...and 6 more figures