Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder
Yiyang Ma, Wenhan Yang, Jiaying Liu
TL;DR
CorrDiff tackles the tension between distortion fidelity and perceptual quality in diffusion-based image compression by leveraging a privileged end-to-end decoder to correct the diffusion sampling using information extracted at the encoder. The method derives a correction to the diffusion score function conditioned on the visible original image, and approximates this correction via an external end-to-end decoder, transmitting only a few bits per time step. Through a two-phase training regime and DDIM-based inference, CorrDiff achieves improved performance across both distortion and perceptual metrics on standard datasets, outperforming prior diffusion- and perceptual-based methods. The approach offers a practical, bitrate-efficient path to high-fidelity, perceptually pleasing reconstructions in diffusion-driven compression applications.
Abstract
The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.
