Table of Contents
Fetching ...

High-Fidelity Generative Image Compression

Fabian Mentzer, George Toderici, Michael Tschannen, Eirikur Agustsson

TL;DR

HiFiC advances neural image compression by integrating a conditional GAN with a rate-distortion-perception framework to produce visually faithful reconstructions at high resolutions. The approach combines a learned hyperprior, a single-scale conditional discriminator, and a ChannelNorm normalization to stabilize training and preserve textures across diverse datasets. Through comprehensive perceptual evaluation (FID, KID, LPIPS, NIQE) and a large user study, HiFiC demonstrates superior perceptual quality at practical bitrates compared to BPG and MSE-based baselines, albeit with trade-offs in PSNR. The work provides extensive ablation studies on normalization, discriminator conditioning, generator capacity, and training stability, offering actionable design guidance for perceptual neural compression and pointing to future directions in perceptual metrics and video compression.

Abstract

We extensively study how to combine Generative Adversarial Networks and learned compression to obtain a state-of-the-art generative lossy compression system. In particular, we investigate normalization layers, generator and discriminator architectures, training strategies, as well as perceptual losses. In contrast to previous work, i) we obtain visually pleasing reconstructions that are perceptually similar to the input, ii) we operate in a broad range of bitrates, and iii) our approach can be applied to high-resolution images. We bridge the gap between rate-distortion-perception theory and practice by evaluating our approach both quantitatively with various perceptual metrics, and with a user study. The study shows that our method is preferred to previous approaches even if they use more than 2x the bitrate.

High-Fidelity Generative Image Compression

TL;DR

HiFiC advances neural image compression by integrating a conditional GAN with a rate-distortion-perception framework to produce visually faithful reconstructions at high resolutions. The approach combines a learned hyperprior, a single-scale conditional discriminator, and a ChannelNorm normalization to stabilize training and preserve textures across diverse datasets. Through comprehensive perceptual evaluation (FID, KID, LPIPS, NIQE) and a large user study, HiFiC demonstrates superior perceptual quality at practical bitrates compared to BPG and MSE-based baselines, albeit with trade-offs in PSNR. The work provides extensive ablation studies on normalization, discriminator conditioning, generator capacity, and training stability, offering actionable design guidance for perceptual neural compression and pointing to future directions in perceptual metrics and video compression.

Abstract

We extensively study how to combine Generative Adversarial Networks and learned compression to obtain a state-of-the-art generative lossy compression system. In particular, we investigate normalization layers, generator and discriminator architectures, training strategies, as well as perceptual losses. In contrast to previous work, i) we obtain visually pleasing reconstructions that are perceptually similar to the input, ii) we operate in a broad range of bitrates, and iii) our approach can be applied to high-resolution images. We bridge the gap between rate-distortion-perception theory and practice by evaluating our approach both quantitatively with various perceptual metrics, and with a user study. The study shows that our method is preferred to previous approaches even if they use more than 2x the bitrate.

Paper Structure

This paper contains 43 sections, 6 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Comparing our method, HiFiC, to the original, as well as BPG at a similar bitrate and at $2{\times}$ the bitrate. We can see that our GAN model produces a high-fidelity reconstruction that is very close to the input, while BPG exhibits blocking artifacts, that are still present at double the bitrate. In the background, we show a split to the original to further indicate how close our reconstruction is. We show many more visual comparisons in Appendix \ref{['sec:supp:morevisuals']}, including more methods, more bitrates, and various datasets. Best viewed on screen.
  • Figure 2: Our architecture. ConvC is a convolution with $C$ channels, with $3{\times}3$ filters, except when denoted otherwise. $\downarrow2$, $\uparrow2$ indicate strided down or up convolutions. Norm is ChannelNorm (see text), LReLU the leaky ReLU xu2015empirical with $\alpha{=}0.2$, NN${\uparrow}16$ nearest neighbor upsampling, $Q$ quantization.
  • Figure 3: Normalized scores for the user study, compared to perceptual metrics. We invert human scores such that lower is better for all. Below each method, we show average bpp, and for learned methods we show the loss components. "no GAN" is our baseline, using the same architecture and distortion $d$ as HiFiC (Ours), but no GAN. "M&S" is the Mean & Scale Hyperprior MSE-optimized baseline. The study shows that training with a GAN yields reconstructions that outperform BPG at practical bitrates, for high-resolution images. Our model at 0.237bpp is preferred to BPG even if BPG uses $2.1{\times}$ the bitrate, and to MSE optimized models even if they use $1.7{\times}$ the bitrate.
  • Figure 4: Rate-distortion and -perception curves on CLIC2020. Arrows in the title indicate whether lower is better ($\downarrow$), or higher is better ($\uparrow$). Methods are described in Section \ref{['sec:evaluation']}.
  • Figure 5: Distortion-perception trade-off.
  • ...and 16 more figures