Table of Contents
Fetching ...

PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling

Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller

TL;DR

PerCoV2 addresses ultra-low bitrate perceptual image compression by building on Stable Diffusion 3 with an explicit discrete entropy model for hyper-latents via an implicit hierarchical masked image model. It compares recent entropy-modeling approaches (VAR and MaskGIT) and demonstrates improved fidelity at bitrates as low as $0.003-0.03$ bpp on MSCOCO-30k and Kodak, while offering a hybrid generation mode for extra savings. The approach remains open-source, with open SD3 backbones and a two-stage training pipeline that jointly optimizes compression and generation. This work advances practical, publicly available perceptual compression, enabling efficient storage and bandwidth use without sacrificing perceptual realism, particularly at ultra-low bitrates.

Abstract

We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at https://github.com/Nikolai10/PerCoV2.

PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling

TL;DR

PerCoV2 addresses ultra-low bitrate perceptual image compression by building on Stable Diffusion 3 with an explicit discrete entropy model for hyper-latents via an implicit hierarchical masked image model. It compares recent entropy-modeling approaches (VAR and MaskGIT) and demonstrates improved fidelity at bitrates as low as bpp on MSCOCO-30k and Kodak, while offering a hybrid generation mode for extra savings. The approach remains open-source, with open SD3 backbones and a two-stage training pipeline that jointly optimizes compression and generation. This work advances practical, publicly available perceptual compression, enabling efficient storage and bandwidth use without sacrificing perceptual realism, particularly at ultra-low bitrates.

Abstract

We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at https://github.com/Nikolai10/PerCoV2.

Paper Structure

This paper contains 20 sections, 13 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Distortion-perception comparison on the Kodak dataset at $512\times512$ resolution (top left is best). We show different operating modes for PerCo and PerCoV2 by varying the number of sampling steps/ classifier-free-guidance; see \ref{['subsec:dp_tradeoff']}.
  • Figure 2: Visual comparison of PerCoV2 on the Kodak dataset at our lowest bit-rate configuration. Bit-rate increases relative to our method are indicated by $(\times)$. For comparisons at higher bit-rates, see \ref{['fig:vis_impressions_2']}. Best viewed electronically.
  • Figure 3: PerCoV2 model overview based on our lowest bit-rate configuration. Colors follow esser2024scaling.
  • Figure 4: Quantitative comparison of PerCoV2 on MSCOCO-30k.
  • Figure 5: Visual comparison of PerCoV2 on the Kodak dataset at an extreme bit-rate configuration. Bit-rate increases relative to our method are indicated by $(\times)$. Best viewed electronically.
  • ...and 13 more figures