Table of Contents
Fetching ...

ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

Hao Cao, Chengbin Liang, Wenqi Guo, Zhijin Qin, Jungong Han

TL;DR

ProGIC is a compact codec built on residual vector quantization (RVQ), a compact codec built on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices.

Abstract

Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

TL;DR

ProGIC is a compact codec built on residual vector quantization (RVQ), a compact codec built on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices.

Abstract

Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.
Paper Structure (42 sections, 8 equations, 23 figures, 6 tables)

This paper contains 42 sections, 8 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: BD-rate vs. Decoding Latency on the Kodak dataset measured with DISTS on one NVIDIA A100 GPU. The proposed ProGIC attains competitive BD-rate while substantially reducing decoding latency. Upper-left indicates better.
  • Figure 2: Conceptual illustration of the motivation behind ProGIC. The original image vector is approximated by a base vector plus a sequence of residual vectors, yielding progressively improved reconstructions.
  • Figure 3: (a) Overview of the proposed ProGIC. Each down-/up-sampling stage consists of a stack of $M$ depthwise convolution blocks and a feed-forward network (FFN). The blocks in $g_s(\cdot)$ are modified with feature modulation, as described in \ref{['sec:3.2']}. (b) Depthwise convolution block. “Depth conv” denotes a depthwise convolution, while others are pointwise convolutions. (c) FFN architecture, where “Chunk-2” splits the tensor into two equal parts along the channel dimension.
  • Figure 4: Feature modulation in an FFN: at each progressive decoding stage, stage-specific scale and bias are applied to the features before the residual addition.
  • Figure 5: Rate-distortion performance on the Kodak, Tecnick, DIV2K, and CLIC2020-Professional datasets, evaluated with LPIPS and DISTS vs. BPP. Curves closer to the lower-left are better, indicating better quality at the same compression ratio. "OOM" denotes out-of-memory under the official evaluation environment.
  • ...and 18 more figures