Table of Contents
Fetching ...

ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Daowen Li, Ruixiao Dong, Ying Chen, Kai Li, Ding Ding, Li Li

Abstract

Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.

ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Abstract

Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.
Paper Structure (24 sections, 9 equations, 8 figures, 12 tables)

This paper contains 24 sections, 9 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of the proposed ProGVC video compression framework. We design a progressive generative video compression framework based on the visual autoregressive model. The left panel illustrates the output pyramid structure of the multi-scale residual quantization. Besides, the sparse attention mask used in the context model is shown on the right.
  • Figure 2: Overview of the multi-scale autoregressive context model. The transformer network predicts both intra scale and inter scale tokens in an autoregressive manner. The resulting probability distributions are subsequently utilized for entropy coding and for sampling discarded tokens during reconstruction.
  • Figure 3: Illustration of three attention mask designs: VAR-like attention (left), InfinityStar attention (middle), and our sparse attention (right). For visualization simplicity, we illustrate the masks using three spatial scales (1, 2, 3) and a temporal length of $T=2$ for inter scale tokens.
  • Figure 4: Rate and perception/fidelity curves on Xiph, HEVC B and MCL-JCV datasets.
  • Figure 5: Visual comparisons with baselines on the Kimono sequence in the Xiph dataset and the ParkScene sequence in the HEVC B dataset.
  • ...and 3 more figures