Table of Contents
Fetching ...

StableCodec: Taming One-Step Diffusion for Extreme Image Compression

Tianyu Zhang, Xin Luo, Li Li, Dong Liu

TL;DR

StableCodec tackles extreme image compression by fusing one-step diffusion with a Deep Compression Latent Codec and a Dual-Branch Coding Structure, enabling high realism and fidelity at ultra-low bitrates. It introduces end-to-end optimization with two-stage implicit bitrate pruning, leveraging perceptual and semantic losses to guide reconstruction under stringent bitrate constraints. Empirically, it achieves state-of-the-art FID, KID, and DISTS on CLIC 2020, DIV2K, and Kodak at bitrates as low as 0.005 bpp, while maintaining competitive inference speed and memory footprint. This approach broadens the practicality of diffusion-based codecs for real-time applications and extreme compression scenarios.

Abstract

Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoder and decoder, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable to mainstream transform coding schemes. All source code are available at https://github.com/LuizScarlet/StableCodec.

StableCodec: Taming One-Step Diffusion for Extreme Image Compression

TL;DR

StableCodec tackles extreme image compression by fusing one-step diffusion with a Deep Compression Latent Codec and a Dual-Branch Coding Structure, enabling high realism and fidelity at ultra-low bitrates. It introduces end-to-end optimization with two-stage implicit bitrate pruning, leveraging perceptual and semantic losses to guide reconstruction under stringent bitrate constraints. Empirically, it achieves state-of-the-art FID, KID, and DISTS on CLIC 2020, DIV2K, and Kodak at bitrates as low as 0.005 bpp, while maintaining competitive inference speed and memory footprint. This approach broadens the practicality of diffusion-based codecs for real-time applications and extreme compression scenarios.

Abstract

Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoder and decoder, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable to mainstream transform coding schemes. All source code are available at https://github.com/LuizScarlet/StableCodec.

Paper Structure

This paper contains 25 sections, 7 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Visual examples and comparisons when compressing a 4K-resolution image li2024ustc at ultra-low bitrates. The proposed StableCodec produces more realistic and consistent details with fewer bits. In contrast, VVC bross2021overview, ELIC he2022elic and MS-ILLM muckley2023improving reconstructions are blurry, while PerCo careil2023towards and DiffEIC li2024towards generate inconsistent details against the original images. Best viewed on screen for details.
  • Figure 2: (Top) Illustration of our motivation. One-step diffusion can produce perceptually consistent results given severely corrupted images and a general prompt. (Bottom) Challenges in StableCodec. How to compress a noisy latent for one-step diffusion using ultra-low bitrates, and how to improve fidelity.
  • Figure 3: The framework of StableCodec. We incorporate the proposed Deep Compression Latent Codec to transmit a noisy latent $l_{T}$ for one-step denoising, where $64\times$ denotes a spatial compression ratio of 64. To adjust the latent resolution, we deploy DownSample block and Conv3$\times$3 as adapters after the VAE encoder $\mathcal{E}_{\mathrm{SD}}$ and auxiliary encoder $\mathcal{E}_{\mathrm{Aux}}$, respectively. We use a general prompt in both training and inference. The auxiliary decoder $\mathcal{D}_{\mathrm{Aux}}$ shares a similar structure with $g_{s}$. More details on networks are provided in the supplementary.
  • Figure 4: Top-energy channels learned from different encoders. $\mathcal{E}_{\mathrm{Aux}}$ embeds more pixel-level semantic information into codec.
  • Figure 5: Impact of $\mathcal{D}_{\mathrm{Aux}}$ on latents and reconstructions.
  • ...and 10 more figures