Table of Contents
Fetching ...

Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer

Naifu Xue, Qi Mao, Zijian Wang, Yuan Zhang, Siwei Ma

TL;DR

This work tackles ultra-low bitrate image coding by addressing the mismatch between prior modeling and reconstruction in existing generative codecs. It introduces Unified Image Generation-Compression (UIGC), which tokenizes images with a Vector-Quantized Image Modeling (VIM) approach, uses a Multi-Stage Transformer (MST) to learn a powerful prior, and applies an edge-preserved checkerboard mask to discard nonessential tokens. The decoder regenerates lost tokens from the learned prior and reconstructs high-quality images using a VQGAN-based decoder, achieving superior perceptual quality at ultra-low bitrates (e.g., ≤0.03 bpp) on Kodak and CLIC datasets and outperforming traditional codecs and prior neural baselines. This approach demonstrates a significant step toward jointly optimizing generation and compression, suggesting a new direction for generative image compression.

Abstract

Recent progress in generative compression technology has significantly improved the perceptual quality of compressed data. However, these advancements primarily focus on producing high-frequency details, often overlooking the ability of generative models to capture the prior distribution of image content, thus impeding further bitrate reduction in extreme compression scenarios (<0.05 bpp). Motivated by the capabilities of predictive language models for lossless compression, this paper introduces a novel Unified Image Generation-Compression (UIGC) paradigm, merging the processes of generation and compression. A key feature of the UIGC framework is the adoption of vector-quantized (VQ) image models for tokenization, alongside a multi-stage transformer designed to exploit spatial contextual information for modeling the prior distribution. As such, the dual-purpose framework effectively utilizes the learned prior for entropy estimation and assists in the regeneration of lost tokens. Extensive experiments demonstrate the superiority of the proposed UIGC framework over existing codecs in perceptual quality and human perception, particularly in ultra-low bitrate scenarios (<=0.03 bpp), pioneering a new direction in generative compression.

Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer

TL;DR

This work tackles ultra-low bitrate image coding by addressing the mismatch between prior modeling and reconstruction in existing generative codecs. It introduces Unified Image Generation-Compression (UIGC), which tokenizes images with a Vector-Quantized Image Modeling (VIM) approach, uses a Multi-Stage Transformer (MST) to learn a powerful prior, and applies an edge-preserved checkerboard mask to discard nonessential tokens. The decoder regenerates lost tokens from the learned prior and reconstructs high-quality images using a VQGAN-based decoder, achieving superior perceptual quality at ultra-low bitrates (e.g., ≤0.03 bpp) on Kodak and CLIC datasets and outperforming traditional codecs and prior neural baselines. This approach demonstrates a significant step toward jointly optimizing generation and compression, suggesting a new direction for generative image compression.

Abstract

Recent progress in generative compression technology has significantly improved the perceptual quality of compressed data. However, these advancements primarily focus on producing high-frequency details, often overlooking the ability of generative models to capture the prior distribution of image content, thus impeding further bitrate reduction in extreme compression scenarios (<0.05 bpp). Motivated by the capabilities of predictive language models for lossless compression, this paper introduces a novel Unified Image Generation-Compression (UIGC) paradigm, merging the processes of generation and compression. A key feature of the UIGC framework is the adoption of vector-quantized (VQ) image models for tokenization, alongside a multi-stage transformer designed to exploit spatial contextual information for modeling the prior distribution. As such, the dual-purpose framework effectively utilizes the learned prior for entropy estimation and assists in the regeneration of lost tokens. Extensive experiments demonstrate the superiority of the proposed UIGC framework over existing codecs in perceptual quality and human perception, particularly in ultra-low bitrate scenarios (<=0.03 bpp), pioneering a new direction in generative compression.
Paper Structure (10 sections, 3 equations, 10 figures, 2 tables)

This paper contains 10 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Qualitative comparisons between state-of-the-art image compression methods, including traditional vvc, learning-based Cheng_2020_CVPR, generative-basedNEURIPS2020_8a50bae2mao2023extreme, and Ours.
  • Figure 2: Overview of the proposed UIGC framework. (a) the overall compression workflow: we adopt a multi-stage transformer, and AE/AD denote arithmetic encoder/decoder; (b) the mask mechanism using mask module $\boldsymbol{M}$, and $\lor$ denotes the logical "OR" operator; (c) entropy decoding and token generation on the decoder side.
  • Figure 3: Our MST. We implement a GPT-style transformer for the prior modeling at each stage.
  • Figure 4: Qualitative comparisons on the Kodak dataset kodak. In particular, ↓ indicates that lower is better.
  • Figure 5: R-D curves on the Kodak kodak and the CLIC CLIC2020 datasets. Ours w/o lost: ours without token lost and generation.
  • ...and 5 more figures