Table of Contents
Fetching ...

CoD: A Diffusion Foundation Model for Image Compression

Zhaoyang Jia, Zihan Zheng, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Houqiang Li, Yan Lu

TL;DR

CoD introduces a compression-oriented diffusion foundation model trained from scratch to optimize both encoding and diffusion-based reconstruction, serving as a reusable backbone for downstream diffusion codecs. It demonstrates superior ultra-low-bitrate performance, notably outperforming text-conditioned diffusion backbones and approaching VTM-level PSNR in pixel-space when paired with DiffC, while maintaining low training cost on open image datasets. The work provides insights into scaling behavior, pixel-space versus latent-space diffusion, and zero-shot distortion-perception control, and it shows practical impact by enabling high-quality compression across standard benchmarks. Overall, CoD lays a foundation for diffusion-based compression research and practical, reproducible exploration on accessible data and hardware.

Abstract

Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300$\times$ faster training than Stable Diffusion ($\sim$ 20 vs. $\sim$ 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes will be released.

CoD: A Diffusion Foundation Model for Image Compression

TL;DR

CoD introduces a compression-oriented diffusion foundation model trained from scratch to optimize both encoding and diffusion-based reconstruction, serving as a reusable backbone for downstream diffusion codecs. It demonstrates superior ultra-low-bitrate performance, notably outperforming text-conditioned diffusion backbones and approaching VTM-level PSNR in pixel-space when paired with DiffC, while maintaining low training cost on open image datasets. The work provides insights into scaling behavior, pixel-space versus latent-space diffusion, and zero-shot distortion-perception control, and it shows practical impact by enabling high-quality compression across standard benchmarks. Overall, CoD lays a foundation for diffusion-based compression research and practical, reproducible exploration on accessible data and hardware.

Abstract

Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300 faster training than Stable Diffusion ( 20 vs. 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes will be released.

Paper Structure

This paper contains 33 sections, 11 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Overview of Compression-oriented Diffusion (CoD) foundation models, which are trained from scratch to jointly optimize compression and generation. Rather than a fixed codec, CoD serves as a foundational model for downstream diffusion-based codecs such as DiffC diffc_original, substantially enhancing their performance by replacing Stable Diffusion.
  • Figure 2: Framework overview of CoD in pixel and latent spaces. CoD consists of a condition encoder, an entropy bottleneck, a condition decoder and a diffusion model which is decoupled to DiT backbone and DDT head ddt. CoD is trained with rectified flow rf, where $1-\alpha\%$ samples are trained at $t=0$ to jointly optimize distortion and perception.
  • Figure 3: Effects of unified training with rectified flow.
  • Figure 4: Scaling law analysis on the Kodak dataset at a resolution of $256\times256$. All CoD models are at 0.016 bpp while MS-ILLM is at 0.021 bpp. FID is computed over overlapping $64\times64$ patches following ddcm.
  • Figure 5: Comparison of CoD and Stable Diffusion on Kodak at $512\times512$ resolution. (left) Pixel-space CoD enables zero-shot distortion-perception controlling by adjusting the sampling steps. CoD is at 0.0039 bpp and PerCo is at 0.0036 bpp. (right) Text conditions harms performance of zero-shot algorithm DiffC on Stable Diffusion, while CoD condition boosts LPIPS at low-bitrate. In addition, pixel-space CoD is not limited by the SD-VAE thus demonstrating wider bitrates, higher PSNR and higher potential in perceptual quality.
  • ...and 13 more figures