Table of Contents
Fetching ...

Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis

Atefeh Khoshkhahtinat, Ali Zafari, Piyush M. Mehta, Nasser M. Nasrabadi

TL;DR

This work addresses perceptual quality and compression efficiency in neural image coding by introducing a non-isotropic, blur-dissipated diffusion decoder that differentiates frequency components and a Transformer-based entropy model with Laplacian-shaped positional encoding to capture rich spatio-channel dependencies. The diffusion decoder is conditioned on a semantic latent and texture latents, enabling coarse-to-fine synthesis, while the entropy model uses uneven channel grouping and a parallel, bidirectional context with local checkerboard and global Transformer blocks. Key contributions include the per-frequency diffusion schedules, the Laplacian-based receptive-field-aware attention, and extensive ablations demonstrating bitrate savings and improved perceptual quality over state-of-the-art generative codecs. The framework achieves superior rate-perception tradeoffs on standard datasets, with practical benefits in visual fidelity and decoding speed through parallelizable context modeling.

Abstract

While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression, their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation, we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents, thereby facilitating the generation of high-quality images. Moreover, our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space, while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer, which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding, the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs, and the proposed entropy model contributes to notable bitrate savings.

Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis

TL;DR

This work addresses perceptual quality and compression efficiency in neural image coding by introducing a non-isotropic, blur-dissipated diffusion decoder that differentiates frequency components and a Transformer-based entropy model with Laplacian-shaped positional encoding to capture rich spatio-channel dependencies. The diffusion decoder is conditioned on a semantic latent and texture latents, enabling coarse-to-fine synthesis, while the entropy model uses uneven channel grouping and a parallel, bidirectional context with local checkerboard and global Transformer blocks. Key contributions include the per-frequency diffusion schedules, the Laplacian-based receptive-field-aware attention, and extensive ablations demonstrating bitrate savings and improved perceptual quality over state-of-the-art generative codecs. The framework achieves superior rate-perception tradeoffs on standard datasets, with practical benefits in visual fidelity and decoding speed through parallelizable context modeling.

Abstract

While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression, their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation, we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents, thereby facilitating the generation of high-quality images. Moreover, our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space, while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer, which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding, the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs, and the proposed entropy model contributes to notable bitrate savings.
Paper Structure (24 sections, 32 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 32 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of our proposed neural codec. The quantized semantic latent variable $\bm{\hat{y}}$ is utilized by a diffusion-based decoder to generate realistically reconstructed image.
  • Figure 2: Diagram illustrating the application of the proposed entropy model for decoding the $j$-th chunk $\bm{{\hat{y}^{(j)}}}$. (b) Global Spatial Context Block. (c) An example of a checkerboard-shaped mask.
  • Figure 3: The procedure for acquiring Laplacian relative position encoding for a window with a size of $2\times 2$.
  • Figure 4: Comparison of our method with other codecs in terms of rate/distortion [bpp $\downarrow$ / PSNR $\uparrow$] and rate-perception, including [bpp $\downarrow$ / FID $\downarrow$] and [bpp $\downarrow$ / LPIPS $\downarrow$], for both the CLIC2020 test set and the Kodak dataset.
  • Figure 5: Visual comparison of our method to HiFiC and CDC models shows that our model achieves superior reconstruction quality, particularly at lower bit-rates. In addition, our model displays fewer artifacts compared to both the HiFiC and CDC models.
  • ...and 3 more figures