Table of Contents
Fetching ...

Towards image compression with perfect realism at ultra-low bitrates

Marlène Careil, Matthew J. Muckley, Jakob Verbeek, Stéphane Lathuilière

TL;DR

This work tackles the challenge of achieving realistic image reconstructions at ultra-low bitrates. It introduces PerCo, a perceptual compression framework that uses a diffusion-based decoder conditioned on a vector-quantized local latent $\mathbf{z}_l$ and a global caption embedding $\mathbf{z}_g$, with classifier-free guidance to balance realism and fidelity. The approach achieves state-of-the-art perceptual metrics (FID, KID) at rates as low as $0.003$ bits per pixel on Kodak and shows robustness across bitrates with improvements in CLIP and mIoU, while highlighting the well-known tradeoffs between distortion and perceptual realism. These results suggest a viable path toward perfect realism in neural image compression, albeit with limitations related to resolution and decoding speed that will guide future research toward scalable, patch-based solutions and deeper bias analyses.

Abstract

Image codecs are typically optimized to trade-off bitrate \vs distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality and remove dependency on the bitrate, we propose to decode with iterative diffusion models. We condition the decoding process on a vector-quantized image representation, as well as a global image description to provide additional context. We dub our model PerCo for 'perceptual compression', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is more than an order of magnitude smaller than those considered in most prior work, compressing a 512x768 Kodak image with less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID. As predicted by rate-distortion-perception theory, visual quality is less dependent on the bitrate than previous methods.

Towards image compression with perfect realism at ultra-low bitrates

TL;DR

This work tackles the challenge of achieving realistic image reconstructions at ultra-low bitrates. It introduces PerCo, a perceptual compression framework that uses a diffusion-based decoder conditioned on a vector-quantized local latent and a global caption embedding , with classifier-free guidance to balance realism and fidelity. The approach achieves state-of-the-art perceptual metrics (FID, KID) at rates as low as bits per pixel on Kodak and shows robustness across bitrates with improvements in CLIP and mIoU, while highlighting the well-known tradeoffs between distortion and perceptual realism. These results suggest a viable path toward perfect realism in neural image compression, albeit with limitations related to resolution and decoding speed that will guide future research toward scalable, patch-based solutions and deeper bias analyses.

Abstract

Image codecs are typically optimized to trade-off bitrate \vs distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality and remove dependency on the bitrate, we propose to decode with iterative diffusion models. We condition the decoding process on a vector-quantized image representation, as well as a global image description to provide additional context. We dub our model PerCo for 'perceptual compression', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is more than an order of magnitude smaller than those considered in most prior work, compressing a 512x768 Kodak image with less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID. As predicted by rate-distortion-perception theory, visual quality is less dependent on the bitrate than previous methods.
Paper Structure (14 sections, 5 equations, 16 figures, 3 tables)

This paper contains 14 sections, 5 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Kodak images compressed with the diffusion-based Text-Sketch approach PICS lei23ncw, the hand-crafted codec VTM vtm, MS-ILLM muckley2023improving which leverages an adversarial loss, and with PerCo (ours). Taking the lowest available bitrate for each method.
  • Figure 2: Overview of PerCo. The LDM encoder maps an RGB image into the latent space of the diffusion model. The "hyper encoder" then maps the image to a hyper-latent with smaller spatial resolution, which is then vector-quantized and represented as a bitstream using uniform coding. The image captioning model generates a textual description of the input image, which is losslessly compressed, and processed by a text encoder to condition the diffusion model. The diffusion model reconstructs the input image in its latent space conditioned on the output of the text encoder and the hyper encoder. Finally, the LDM decoder maps the latent reconstruction back to RGB pixel space.
  • Figure 3: Evaluation of PerCo and other image compression codecs on Kodak and MS-COCO 30k.
  • Figure 4: Comparisons on COCO at $256\!\times\!256$ resolution.
  • Figure 5: Comparing PerCo on images from the Kodak dataset to MS-ILLM muckley2023improving which leverages an adversarial loss, and the diffusion-based Text-sketch approach lei23ncw conditioned on a text only (top, PIC, 0.0025 bpp) and text + sketch (bottom, PICS, 0.028 bpp).
  • ...and 11 more figures