Table of Contents
Fetching ...

Towards Extreme Image Compression with Latent Feature Guidance and Diffusion Prior

Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, Jingwen Jiang

TL;DR

This work proposes a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates.

Abstract

Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat the latent representation of images in the diffusion space as guidance, employing a VAE-based compression approach to compress images and initially decode the compressed information into content variables. The second stage leverages pre-trained stable diffusion to reconstruct images under the guidance of content variables. Specifically, we introduce a small control module to inject content information while keeping the stable diffusion model fixed to maintain its generative capability. Furthermore, we design a space alignment loss to force the content variables to align with the diffusion space and provide the necessary constraints for optimization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in terms of visual performance at extremely low bitrates. The source code and trained models are available at https://github.com/huai-chang/DiffEIC.

Towards Extreme Image Compression with Latent Feature Guidance and Diffusion Prior

TL;DR

This work proposes a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates.

Abstract

Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat the latent representation of images in the diffusion space as guidance, employing a VAE-based compression approach to compress images and initially decode the compressed information into content variables. The second stage leverages pre-trained stable diffusion to reconstruct images under the guidance of content variables. Specifically, we introduce a small control module to inject content information while keeping the stable diffusion model fixed to maintain its generative capability. Furthermore, we design a space alignment loss to force the content variables to align with the diffusion space and provide the necessary constraints for optimization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in terms of visual performance at extremely low bitrates. The source code and trained models are available at https://github.com/huai-chang/DiffEIC.
Paper Structure (34 sections, 14 equations, 12 figures, 3 tables)

This paper contains 34 sections, 14 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Visual examples of the reconstructed results on the Kodak Kodak dataset. The proposed DiffEIC produces much better results in terms of perception and fidelity. For example, the small attic is well reconstructed.
  • Figure 2: The two-stage pipeline of the proposed DiffEIC. Image Compression: Initially, we leverage the VAE-based latent feature-guided compression module (LFGCM) to adaptively select information essential for reconstruction and obtain $z_c$. Image Reconstruction: We leverage the conditional diffusion decoding module (CDDM) for realistic image reconstruction and obtain $\hat{x}$. The CDDM contains a trainable control module and a fixed noise estimator. Note that the control module and noise estimator are connected with zero convolutions (zero-initialized convolution layers).
  • Figure 3: The architecture of the proposed LFGCM. $Conv\ k3s1$ denotes convolution with $3\times 3$ filters and stride 1. $Tconv\ k3s1$ denotes transposed convolution with $3\times 3$ filters and stride 1. $RB$ denotes residual block resnet. $RBneck$ denotes residual bottleneck block resnet. $LReLU$ denotes the LeakyReLU function. $AE$ and $AD$ denote arithmetic encoder and decoder, respectively. $C_m$ denotes context model. $Fusion$ denotes the fusion method. The black and red arrows denote main and guidance flow, respectively.
  • Figure 4: Quantitative comparisons with state-of-the-art methods in terms of perceptual quality (LPIPS$\downarrow$ / NIQE$\downarrow$ / DISTS$\downarrow$ / FID$\downarrow$ / KID$\downarrow$) on the Kodak Kodak, Tecnick Tecnick, and CLIC2020 CLIC2020 datasets.
  • Figure 5: Quantitative comparisons with state-of-the-art methods in terms of pixel fidelity (MS-SSIM$\uparrow$ / PSNR$\uparrow$) on the Kodak Kodak, Tecnick Tecnick, and CLIC2020 CLIC2020 datasets.
  • ...and 7 more figures