Table of Contents
Fetching ...

Generative Preprocessing for Image Compression with Pre-trained Diffusion Models

Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang

TL;DR

This work reframes image preprocessing for compression as a rate-perception problem by leveraging large pre-trained diffusion models. It distills Stable Diffusion 2.1 into a compact one-step generator via Consistent Score Identity Distillation and then finely tunes only the attention modules using a differentiable BPG surrogate and a rate-perception loss to guide optimization. The approach achieves substantial BD-rate reductions (up to 30.13% in DISTS on Kodak) and superior perceptual quality across standard codecs, while remaining compatible with existing pipelines. This demonstrates the potential of generative priors to enhance perceptual compression preprocessing and informs future rate-perception optimization strategies.

Abstract

Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model's attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model's powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.

Generative Preprocessing for Image Compression with Pre-trained Diffusion Models

TL;DR

This work reframes image preprocessing for compression as a rate-perception problem by leveraging large pre-trained diffusion models. It distills Stable Diffusion 2.1 into a compact one-step generator via Consistent Score Identity Distillation and then finely tunes only the attention modules using a differentiable BPG surrogate and a rate-perception loss to guide optimization. The approach achieves substantial BD-rate reductions (up to 30.13% in DISTS on Kodak) and superior perceptual quality across standard codecs, while remaining compatible with existing pipelines. This demonstrates the potential of generative priors to enhance perceptual compression preprocessing and informs future rate-perception optimization strategies.

Abstract

Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model's attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model's powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.

Paper Structure

This paper contains 5 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An overview of our proposed two-stage framework. (Top) Distillation Stage: We distill a pre-trained, multi-step Stable Diffusion 2.1 model into a compact, one-step U-Net using Consistent Score Identity Distillation (CiD). The VAE and text conditioning (replaced by a fixed embedding) are kept frozen. (Bottom) Rate-Perception Finetune Stage: The distilled one-step generator is fine-tuned for the preprocessing task. An input image is processed by the finetuned U-Net (with frozen VAE) and then passed through a differentiable BPG (Diff-BPG) surrogate. A composite loss, balancing L1, perceptual, and bitrate terms with a dynamic QP-based schedule, guides the optimization of the U-Net's attention modules.
  • Figure 2: The Detail of the Distillation Stage.
  • Figure 3: Rate-Perception (R-P) curves on the CLIC validation dataset and Kodak dataset. These plots compare the perceived quality (LPIPS, DISTS, TOPIQ-fr) against the bitrate (bpp) for different codecs (JPEG, WebP, BPG) and preprocessing methods (TDP, Ours). Our method consistently achieves better perceptual quality at lower bitrates across all metrics and datasets.
  • Figure 4: Qualitative comparison on a sample from the CLIC validation dataset. Our method effectively removes compression artifacts while preserving and enhancing fine-grained textures, leading to superior visual quality compared to the anchor WebP and TDP.