Table of Contents
Fetching ...

CADC: Content Adaptive Diffusion-Based Generative Image Compression

Xihua Sheng, Lingyu Zhu, Tianyu Zhang, Dong Liu, Shiqi Wang, Jing Wang

TL;DR

A content-adaptive diffusion-based image codec with three technical innovations: an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels.

Abstract

Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck -- arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input -- prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost.

CADC: Content Adaptive Diffusion-Based Generative Image Compression

TL;DR

A content-adaptive diffusion-based image codec with three technical innovations: an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels.

Abstract

Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck -- arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input -- prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost.
Paper Structure (26 sections, 16 equations, 14 figures, 3 tables)

This paper contains 26 sections, 16 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: A qualitative comparison between our codec, StableCodec zhang2025stablecodec, and DLF xue2025dlf when compressing a 2K-resolution image of the test set of CLIC 2020 Professional toderici2020clic under ultra-low bitrate conditions. Our codec produces images with high visual quality, especially in regions with complex texture. In contrast, DLF and StableCodec exhibit noticeable artifacts, such as blurring and color shifting.
  • Figure 2: On the encoder side, an analysis transform $g_a$ encodes the input image $\mathbf{x}$ into a compact latent representation $\mathbf{y}$. An uncertainty map $\mathbf{m}$ is estimated by $f_u$ to guide the quantization of $\mathbf{y}$. The quantized latent $\mathbf{\hat{y}}$ is encoded into a bitstream via an arithmetic encoder (AE) and transmitted. On the decoder side, a synthesis transform $g_s$ upsamples $\mathbf{\hat{y}}$ to produce a noisy latent $\bm{l}_T$ at the spatial resolution required by the pre-trained Stable Diffusion VAE decoder $\mathcal{D}_{SD}$rombach2022high. In learned codecs, $\bm{l}_T$ typically has a high channel count (e.g., 320), while $\mathcal{D}_{SD}$ is fixed to accept only 4-channel input. To resolve this, the entire $\bm{l}_T$ is commonly input to the Unet $\epsilon_{S D}$ (a new input channel number is set to the first convolutional layer of the Unet) to utilize all available context for estimating more accurate 4-channel noise (the output channel number of the Unet is still 4) zhang2025stablecodec. The denoising process is applied exclusively to the first four noisy channels $\bm{l}_T^{(1:4)}$, yielding the standard 4-channel clean latent $\bm{l}_0$ for $\mathcal{D}_{SD}$. To concentrate essential semantic information into $\bm{l}_T^{(1:4)}$, a lightweight auxiliary decoder $g_{aux}$ takes $\bm{l}_T^{(1:4)}$ as inputs to reconstruct an auxiliary image $\mathbf{\hat{x}}_{aux}$. To produce a content-adaptive textual description $c_{aux}$, $\mathbf{\hat{x}}_{aux}$ is captioned by $f_c$ (a frozen BLIP li2022blip). $c_{aux}$ is then combined with a fixed description $c_{fix}$ to condition a one-step diffusion denoising process sauer2024adversarial.
  • Figure 3: Quantitative comparisons of different generative image codecs on Kodak, DIV2K Val, and CLIC 2020 Test.
  • Figure 4: Qualitative comparison of different generative image codecs on the Kodak dataset under ultra-low bitrate conditions.
  • Figure 5: Analysis of Uncertainty-Guided Adaptive Quantization (UGAQ) and the isotropic quantization on the DIV2K dataset.
  • ...and 9 more figures