Table of Contents
Fetching ...

High-Fidelity Image Compression with Score-based Generative Models

Emiel Hoogeboom, Eirikur Agustsson, Fabian Mentzer, Luca Versari, George Toderici, Lucas Theis

TL;DR

This work shows that score-based generative models, when carefully tuned for image compression, can surpass GAN-based baselines in perceptual realism at low bit-rates. The authors propose a two-stage pipeline: a pretrained MSE-optimized autoencoder (ELIC) followed by a diffusion- or flow-based decoder that refines reconstructions without adding bitrate. Key innovations include a reduced-noise diffusion schedule focused on high-frequency details, v-prediction for stability, and patchwise, parallel sampling to handle high-resolution images. Across datasets such as CLIC20 and MS-COCO 30k, the approach achieves state-of-the-art rate-FID, highlighting a favorable realism-versus-distortion trade-off and establishing diffusion-based compression as a competitive direction. The work also analyzes practical considerations, such as sampling efficiency and the impact of realism bias, and points to future improvements in faster diffusion techniques.

Abstract

Despite the tremendous success of diffusion generative models in text-to-image generation, replicating this success in the domain of image compression has proven difficult. In this paper, we demonstrate that diffusion can significantly improve perceptual quality at a given bit-rate, outperforming state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is achieved using a simple but theoretically motivated two-stage approach combining an autoencoder targeting MSE followed by a further score-based decoder. However, as we will show, implementation details matter and the optimal design decisions can differ greatly from typical text-to-image models.

High-Fidelity Image Compression with Score-based Generative Models

TL;DR

This work shows that score-based generative models, when carefully tuned for image compression, can surpass GAN-based baselines in perceptual realism at low bit-rates. The authors propose a two-stage pipeline: a pretrained MSE-optimized autoencoder (ELIC) followed by a diffusion- or flow-based decoder that refines reconstructions without adding bitrate. Key innovations include a reduced-noise diffusion schedule focused on high-frequency details, v-prediction for stability, and patchwise, parallel sampling to handle high-resolution images. Across datasets such as CLIC20 and MS-COCO 30k, the approach achieves state-of-the-art rate-FID, highlighting a favorable realism-versus-distortion trade-off and establishing diffusion-based compression as a competitive direction. The work also analyzes practical considerations, such as sampling efficiency and the impact of realism bias, and points to future improvements in faster diffusion techniques.

Abstract

Despite the tremendous success of diffusion generative models in text-to-image generation, replicating this success in the domain of image compression has proven difficult. In this paper, we demonstrate that diffusion can significantly improve perceptual quality at a given bit-rate, outperforming state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is achieved using a simple but theoretically motivated two-stage approach combining an autoencoder targeting MSE followed by a further score-based decoder. However, as we will show, implementation details matter and the optimal design decisions can differ greatly from typical text-to-image models.
Paper Structure (30 sections, 13 equations, 19 figures, 2 tables)

This paper contains 30 sections, 13 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1:
  • Figure 2:
  • Figure 3:
  • Figure 4: Overview of our high-fidelity diffusion (HFD) approach. The output of a standard MSE autoencoder is used by a denoising diffusion model to produce realistic samples by iteratively denoising for $\mathrm{T}$ steps.
  • Figure 5:
  • ...and 14 more figures