Table of Contents
Fetching ...

Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures

Chuqin Zhou, Xiaoyue Ling, Yunuo Chen, Jincheng Dai, Guo Lu, Wenjun Zhang

TL;DR

This work addresses ultra-low bitrate image compression by unifying explicit semantic information with implicit diffusion-based textures in a training-free framework. By encoding compact explicit signals ($\\hat{y}, c$) and transmitting textured details through reverse-channel coding conditioned on those signals, the method achieves strong perceptual quality while preserving semantic fidelity. A plug-in distortion-perception module and tile-based processing provide fine-grained control and scalability to high-resolution inputs, enabling flexible rate–distortion–perception tradeoffs. Empirical results demonstrate state-of-the-art performance across Kodak, DIV2K, and CLIC2020, with substantial gains in DISTS, CLIPSim, and related metrics over prior diffusion- and explicit-prior methods, and strong compatibility with multiple base codecs.

Abstract

While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.

Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures

TL;DR

This work addresses ultra-low bitrate image compression by unifying explicit semantic information with implicit diffusion-based textures in a training-free framework. By encoding compact explicit signals () and transmitting textured details through reverse-channel coding conditioned on those signals, the method achieves strong perceptual quality while preserving semantic fidelity. A plug-in distortion-perception module and tile-based processing provide fine-grained control and scalability to high-resolution inputs, enabling flexible rate–distortion–perception tradeoffs. Empirical results demonstrate state-of-the-art performance across Kodak, DIV2K, and CLIC2020, with substantial gains in DISTS, CLIPSim, and related metrics over prior diffusion- and explicit-prior methods, and strong compatibility with multiple base codecs.

Abstract

While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.
Paper Structure (30 sections, 2 equations, 14 figures, 5 tables, 4 algorithms)

This paper contains 30 sections, 2 equations, 14 figures, 5 tables, 4 algorithms.

Figures (14)

  • Figure 1: Comparison of different methods at ultra-low bitrates. Blue denotes explicit compression methods, green denotes implicit ones, and red represents our dual representations approach. Our method better preserves textural and semantic details.
  • Figure 2: Visual examples and comparisons on 2K-resolution image at ultra-low bitrates. Our method reconstructs more realistic and consistent details with fewer bits. In contrast, PerCo Careil_2024_ICLR_Perco, DiffEIC Li_25_TCSVT_DiffEIC and DiffC Vonderfecht_25_ICLR_Diffc exhibit inconsistent details compared to the original images. Best viewed on screen for details.
  • Figure 3: Overview of the proposed dual branch compression framework. Explicit semantics consist of the quantized latent $\hat{y}$ and a tag-style text prompt $c$, while implicit textures are derived from noise-corrupted latents using RCC.
  • Figure 4: Rate–metric curves on Kodak, DIV2K, and CLIC2020 datasets. Arrows indicate whether higher ($\uparrow$) or lower ($\downarrow$) values are better. See supplementary for more results.
  • Figure 5: Rate–metric curves on the Kodak dataset. Our method is applied to multiple versions of each base model, trained at different bitrates, resulting in several curves per model. Our method consistently outperforms the corresponding baselines across all bitrate settings.
  • ...and 9 more figures