Table of Contents
Fetching ...

Generative Latent Coding for Ultra-Low Bitrate Image and Video Compression

Linfeng Qi, Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

TL;DR

GLC introduces transform coding in the generative latent space of a discrete VQ-VAE to enable ultra-low bitrate image and video compression. It adds rate-variable latent transformation, a spatial categorical hyper module for images, a spatio-temporal categorical hyper module for video, and a code-prediction based training objective to guide latent semantics. Image results reach $0.04$ bpp with the same FID as the prior state-of-the-art while saving about $45\%$ of bits, and video achieves a $65.3\%$ bitrate saving in DISTS over PLVC, underscoring substantial perceptual gains at ultra-low bitrates. The approach delivers scalable bitrate control and practical performance by leveraging perceptually aligned latent representations and global semantic dynamics in video.

Abstract

Most existing approaches for image and video compression perform transform coding in the pixel space to reduce redundancy. However, due to the misalignment between the pixel-space distortion and human perception, such schemes often face the difficulties in achieving both high-realism and high-fidelity at ultra-low bitrate. To solve this problem, we propose \textbf{G}enerative \textbf{L}atent \textbf{C}oding (\textbf{GLC}) models for image and video compression, termed GLC-image and GLC-Video. The transform coding of GLC is conducted in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE). Compared to the pixel-space, such a latent space offers greater sparsity, richer semantics and better alignment with human perception, and show its advantages in achieving high-realism and high-fidelity compression. To further enhance performance, we improve the hyper prior by introducing a spatial categorical hyper module in GLC-image and a spatio-temporal categorical hyper module in GLC-video. Additionally, the code-prediction-based loss function is proposed to enhance the semantic consistency. Experiments demonstrate that our scheme shows high visual quality at ultra-low bitrate for both image and video compression. For image compression, GLC-image achieves an impressive bitrate of less than $0.04$ bpp, achieving the same FID as previous SOTA model MS-ILLM while using $45\%$ fewer bitrate on the CLIC 2020 test set. For video compression, GLC-video achieves 65.3\% bitrate saving over PLVC in terms of DISTS.

Generative Latent Coding for Ultra-Low Bitrate Image and Video Compression

TL;DR

GLC introduces transform coding in the generative latent space of a discrete VQ-VAE to enable ultra-low bitrate image and video compression. It adds rate-variable latent transformation, a spatial categorical hyper module for images, a spatio-temporal categorical hyper module for video, and a code-prediction based training objective to guide latent semantics. Image results reach bpp with the same FID as the prior state-of-the-art while saving about of bits, and video achieves a bitrate saving in DISTS over PLVC, underscoring substantial perceptual gains at ultra-low bitrates. The approach delivers scalable bitrate control and practical performance by leveraging perceptually aligned latent representations and global semantic dynamics in video.

Abstract

Most existing approaches for image and video compression perform transform coding in the pixel space to reduce redundancy. However, due to the misalignment between the pixel-space distortion and human perception, such schemes often face the difficulties in achieving both high-realism and high-fidelity at ultra-low bitrate. To solve this problem, we propose \textbf{G}enerative \textbf{L}atent \textbf{C}oding (\textbf{GLC}) models for image and video compression, termed GLC-image and GLC-Video. The transform coding of GLC is conducted in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE). Compared to the pixel-space, such a latent space offers greater sparsity, richer semantics and better alignment with human perception, and show its advantages in achieving high-realism and high-fidelity compression. To further enhance performance, we improve the hyper prior by introducing a spatial categorical hyper module in GLC-image and a spatio-temporal categorical hyper module in GLC-video. Additionally, the code-prediction-based loss function is proposed to enhance the semantic consistency. Experiments demonstrate that our scheme shows high visual quality at ultra-low bitrate for both image and video compression. For image compression, GLC-image achieves an impressive bitrate of less than bpp, achieving the same FID as previous SOTA model MS-ILLM while using fewer bitrate on the CLIC 2020 test set. For video compression, GLC-video achieves 65.3\% bitrate saving over PLVC in terms of DISTS.

Paper Structure

This paper contains 25 sections, 7 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: For ultra-low bitrates, the generative latent space of VQ-VAE provides a better alignment with human perception than the pixel space. At comparable distortion levels, latent-space compression yields reconstructions with superior perceptual quality compared to the pixel-space generative codec MS-ILLM muckley2023improving, as measured by signal-to-noise ratio (SNR). The perceptual enhancement is quantified using the DISTS metric ding2020image.
  • Figure 2: Left: Comparison with previous methods. Unlike traditional approaches that perform transform coding in the pixel space, our scheme operates in the generative latent space. The generative latent coding pipeline involves three steps: (1) encoding the input into a generative latent space, (2) compressing the latents using transform coding, and (3) decoding the compressed latent to reconstruct the image. Center: Illustration of the proposed GLC-image for image compression. Right: Illustration of the proposed GLC-video for video compression.
  • Figure 3: Illustration of the transform coding in the latent space of GLC-image and comparison with other coding schemes in operational diagrams. (a) The model structure of transform coding module in GLC-image. (b) indices-map coding mao2023extremejiang2023adaptive, (c) transform coding with factorized hyper module balle2018variational and (d) proposed transform coding with our spatial categorical hyper module. Here, AE and AD denote arithmetic encoding and decoding, VQ-E and VQ-D refer to VQ-indices-map encoding and decoding, Q represents scalar quantization, U signifies the addition of uniform noise as a differential simulation of Q, and S denotes the spatial context entropy module.
  • Figure 4: Visual comparison for the spatial categorical hyper module and factorized hyper module. The bit per pixel (bpp) value for coding $y_0$ and $z_0$, and the bpp multiplier relative to our method are shown. The proposed spatial categorical hyper module encodes essential semantic and structural information with significantly fewer bits while being less susceptible to low-level noise, thereby achieving comparable visual quality and substantially reducing the overall bit cost.
  • Figure 5: Our proposed spatio-temporal categorical hyper module in video compression. The token generation and token fusion modules are also illustrated.
  • ...and 12 more figures