Table of Contents
Fetching ...

Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder

Yiyang Ma, Wenhan Yang, Jiaying Liu

TL;DR

CorrDiff tackles the tension between distortion fidelity and perceptual quality in diffusion-based image compression by leveraging a privileged end-to-end decoder to correct the diffusion sampling using information extracted at the encoder. The method derives a correction to the diffusion score function conditioned on the visible original image, and approximates this correction via an external end-to-end decoder, transmitting only a few bits per time step. Through a two-phase training regime and DDIM-based inference, CorrDiff achieves improved performance across both distortion and perceptual metrics on standard datasets, outperforming prior diffusion- and perceptual-based methods. The approach offers a practical, bitrate-efficient path to high-fidelity, perceptually pleasing reconstructions in diffusion-driven compression applications.

Abstract

The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.

Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder

TL;DR

CorrDiff tackles the tension between distortion fidelity and perceptual quality in diffusion-based image compression by leveraging a privileged end-to-end decoder to correct the diffusion sampling using information extracted at the encoder. The method derives a correction to the diffusion score function conditioned on the visible original image, and approximates this correction via an external end-to-end decoder, transmitting only a few bits per time step. Through a two-phase training regime and DDIM-based inference, CorrDiff achieves improved performance across both distortion and perceptual metrics on standard datasets, outperforming prior diffusion- and perceptual-based methods. The approach offers a practical, bitrate-efficient path to high-fidelity, perceptually pleasing reconstructions in diffusion-driven compression applications.

Abstract

The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.
Paper Structure (20 sections, 5 theorems, 31 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 20 sections, 5 theorems, 31 equations, 9 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.1

The conditional distribution $q_t(\mathbf{x}_0^*|\hat{\mathbf{y}}, \mathbf{x}_t)$ can be approximated by $q_t(\mathbf{x}_0^*|\hat{\mathbf{y}}, \hat{\mathbf{x}}_{0, t})$.

Figures (9)

  • Figure 1: Visual results compared to CDC 2023CDC and ILLM 2023ILLM. The patch is cropped from daniel-robert-405.png from CLIC professional dataset 2020CLIC. [Zoom in for best view]
  • Figure 2: The framework of the proposed method. $\mathbf{E}$ denotes the encoder, $\mathbf{D}$ denotes the end-to-end decoder, and $\bm{\mu}_\theta$ denotes the score network. The yellow frame denotes the transmitted parts. The subimage (a) illustrates the pipeline at the encoder side, which obtains the representation $\hat{\mathbf{y}}$ and the factor set $\{\gamma^*_t\}_{t=1}^T$. The subimage (b) shows the reconstruction process on the decoder side.
  • Figure 3: performance of diverse metrics on CLIC professional dataset. [Zoom in for best view]
  • Figure 4: Visual results compared to CDC 2023CDC and ILLM 2023ILLM. [Zoom in for best view]
  • Figure 5: performance of diverse metrics on DIV2K test dataset. [Zoom in for best view]
  • ...and 4 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Theorem 3.2
  • Lemma 1.1
  • Theorem 1.2
  • Theorem 1.3