Table of Contents
Fetching ...

A Residual Diffusion Model for High Perceptual Quality Codec Augmentation

Noor Fathima Ghouse, Jens Petersen, Auke Wiggers, Tianlin Xu, Guillaume Sautière

TL;DR

DIRAC introduces a diffusion-based residual augmentation approach to image compression by coupling a strong base codec with a receiver-side residual diffusion model. The method models residuals conditioned on the base reconstruction, enabling a smooth, test-time traversal of the rate-distortion-perception tradeoff with efficient sampling (often 20 steps) through late-start sampling and rate-dependent thresholding. It applies to both generative compression (enhancing neural base codecs) and enhancement of standard codecs (JPEG, VTM), achieving competitive perceptual metrics (FID/256, LPIPS) while preserving PSNR, and providing practical tradeoff control. The approach demonstrates strong results on high-resolution datasets and offers a viable path for deploying perceptually rich reconstructions in real-world pipelines.

Abstract

Diffusion probabilistic models have recently achieved remarkable success in generating high quality image and video data. In this work, we build on this class of generative models and introduce a method for lossy compression of high resolution images. The resulting codec, which we call DIffuson-based Residual Augmentation Codec (DIRAC), is the first neural codec to allow smooth traversal of the rate-distortion-perception tradeoff at test time, while obtaining competitive performance with GAN-based methods in perceptual quality. Furthermore, while sampling from diffusion probabilistic models is notoriously expensive, we show that in the compression setting the number of steps can be drastically reduced.

A Residual Diffusion Model for High Perceptual Quality Codec Augmentation

TL;DR

DIRAC introduces a diffusion-based residual augmentation approach to image compression by coupling a strong base codec with a receiver-side residual diffusion model. The method models residuals conditioned on the base reconstruction, enabling a smooth, test-time traversal of the rate-distortion-perception tradeoff with efficient sampling (often 20 steps) through late-start sampling and rate-dependent thresholding. It applies to both generative compression (enhancing neural base codecs) and enhancement of standard codecs (JPEG, VTM), achieving competitive perceptual metrics (FID/256, LPIPS) while preserving PSNR, and providing practical tradeoff control. The approach demonstrates strong results on high-resolution datasets and offers a viable path for deploying perceptually rich reconstructions in real-world pipelines.

Abstract

Diffusion probabilistic models have recently achieved remarkable success in generating high quality image and video data. In this work, we build on this class of generative models and introduce a method for lossy compression of high resolution images. The resulting codec, which we call DIffuson-based Residual Augmentation Codec (DIRAC), is the first neural codec to allow smooth traversal of the rate-distortion-perception tradeoff at test time, while obtaining competitive performance with GAN-based methods in perceptual quality. Furthermore, while sampling from diffusion probabilistic models is notoriously expensive, we show that in the compression setting the number of steps can be drastically reduced.
Paper Structure (47 sections, 10 equations, 16 figures, 4 tables)

This paper contains 47 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Base reconstruction (left) and the DIRAC-enhanced version (right). Our model combines a base codec with a receiver-side enhancement model, and can smoothly interpolate between near-state-of-the-art fidelity (PSNR, higher better) and near-state-of-the-art perceptual quality (FID, lower better). For JPEG ($QF=5$) specifically, we achieve a drastic improvement in perceptual quality without loss in PSNR. Best viewed digitally, PSNR measured on the shown example, FID/256 measured on the full CLIC 2020 test dataset.
  • Figure 2: Overview of our architecture. Given an input image ${\mathbf{x}}$ and target rate factor $\lambda_{rate}$, we obtain a base codec reconstruction $\mathbf{\tilde{x}}$. Our DDPM is conditioned on $\mathbf{\tilde{x}}$ and learns to model a reverse diffusion process that generates residuals $r_0$ from sampled gaussian noise latents $r_T$. The enhanced reconstruction $\mathbf{\hat{x}}$ is then obtained by adding the predicted residual to $\mathbf{\tilde{x}}$
  • Figure 3: Rate-distortion (left) and rate-perception (right) curves for the CLIC2020 test set (top) and Kodak dataset (bottom). The Kodak dataset has too few samples for FID/256 evaluation, instead we evaluate LPIPS, a perceptual distortion metric.
  • Figure 4: CLIC 2020 test reconstructions comparing our model to MultiRealismagustsson2022multi. We show original (top left), Swint-ChARM base codec (bottom left), DIRAC-1 (high fidelity) and DIRAC-100 (high perceptual quality) in center column, MultiRealism counterparts in right column. Shown scores are for full image. Best viewed electronically.
  • Figure 5: CLIC 2020 test reconstruction by DIRAC-100 and MS-ILLM, crop location chosen based on muckley2023improving.
  • ...and 11 more figures