Table of Contents
Fetching ...

MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression

Wei Jiang, Jiayu Yang, Yongqi Zhai, Feng Gao, Ronggang Wang

TL;DR

This work tackles the bottleneck of entropy modeling in learned image compression by introducing MEM++—a linear-complexity, multi-reference entropy model that jointly captures channel-wise, local spatial, and global spatial correlations. Building on MEM++, the authors present MLIC++, a codec that partitions latent representations into slices and uses four context streams (channel-wise, local, intra-slice global, inter-slice global) to estimate conditional entropies efficiently. Key innovations include a shifted window-based overlapped checkerboard attention for local context and linear-complexity decomposed softmax-based global attention for intra- and inter-slice contexts, enabling high-resolution coding with linear memory and reduced computation. Empirical results on Kodak, Tecnick, and CLIC demonstrate state-of-the-art BD-rate reductions (e.g., $-13.39\%$ on Kodak) and competitive RD performance, with substantial memory and speed advantages over quadratic-context methods. The work provides practical, scalable advances for learned image compression and includes public code and training data to support reproducibility.

Abstract

The latent representation in learned image compression encompasses channel-wise, local spatial, and global spatial correlations, which are essential for the entropy model to capture for conditional entropy minimization. Efficiently capturing these contexts within a single entropy model, especially in high-resolution image coding, presents a challenge due to the computational complexity of existing global context modules. To address this challenge, we propose the Linear Complexity Multi-Reference Entropy Model (MEM$^{++}$). Specifically, the latent representation is partitioned into multiple slices. For channel-wise contexts, previously compressed slices serve as the context for compressing a particular slice. For local contexts, we introduce a shifted-window-based checkerboard attention module. This module ensures linear complexity without sacrificing performance. For global contexts, we propose a linear complexity attention mechanism. It captures global correlations by decomposing the softmax operation, enabling the implicit computation of attention maps from previously decoded slices. Using MEM$^{++}$ as the entropy model, we develop the image compression method MLIC$^{++}$. Extensive experimental results demonstrate that MLIC$^{++}$ achieves state-of-the-art performance, reducing BD-rate by $13.39\%$ on the Kodak dataset compared to VTM-17.0 in Peak Signal-to-Noise Ratio (PSNR). Furthermore, MLIC$^{++}$ exhibits linear computational complexity and memory consumption with resolution, making it highly suitable for high-resolution image coding. Code and pre-trained models are available at https://github.com/JiangWeibeta/MLIC. Training dataset is available at https://huggingface.co/datasets/Whiteboat/MLIC-Train-100K.

MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression

TL;DR

This work tackles the bottleneck of entropy modeling in learned image compression by introducing MEM++—a linear-complexity, multi-reference entropy model that jointly captures channel-wise, local spatial, and global spatial correlations. Building on MEM++, the authors present MLIC++, a codec that partitions latent representations into slices and uses four context streams (channel-wise, local, intra-slice global, inter-slice global) to estimate conditional entropies efficiently. Key innovations include a shifted window-based overlapped checkerboard attention for local context and linear-complexity decomposed softmax-based global attention for intra- and inter-slice contexts, enabling high-resolution coding with linear memory and reduced computation. Empirical results on Kodak, Tecnick, and CLIC demonstrate state-of-the-art BD-rate reductions (e.g., on Kodak) and competitive RD performance, with substantial memory and speed advantages over quadratic-context methods. The work provides practical, scalable advances for learned image compression and includes public code and training data to support reproducibility.

Abstract

The latent representation in learned image compression encompasses channel-wise, local spatial, and global spatial correlations, which are essential for the entropy model to capture for conditional entropy minimization. Efficiently capturing these contexts within a single entropy model, especially in high-resolution image coding, presents a challenge due to the computational complexity of existing global context modules. To address this challenge, we propose the Linear Complexity Multi-Reference Entropy Model (MEM). Specifically, the latent representation is partitioned into multiple slices. For channel-wise contexts, previously compressed slices serve as the context for compressing a particular slice. For local contexts, we introduce a shifted-window-based checkerboard attention module. This module ensures linear complexity without sacrificing performance. For global contexts, we propose a linear complexity attention mechanism. It captures global correlations by decomposing the softmax operation, enabling the implicit computation of attention maps from previously decoded slices. Using MEM as the entropy model, we develop the image compression method MLIC. Extensive experimental results demonstrate that MLIC achieves state-of-the-art performance, reducing BD-rate by on the Kodak dataset compared to VTM-17.0 in Peak Signal-to-Noise Ratio (PSNR). Furthermore, MLIC exhibits linear computational complexity and memory consumption with resolution, making it highly suitable for high-resolution image coding. Code and pre-trained models are available at https://github.com/JiangWeibeta/MLIC. Training dataset is available at https://huggingface.co/datasets/Whiteboat/MLIC-Train-100K.
Paper Structure (36 sections, 1 theorem, 18 equations, 15 figures, 6 tables)

This paper contains 36 sections, 1 theorem, 18 equations, 15 figures, 6 tables.

Key Result

Theorem 3.1

jiang2024ecvc Same as the standard vanilla attention, each row of the implicit similarity matrix ${softmax}_2(\hat{\boldsymbol{y}}_{na,q}^i){softmax}_1(\hat{\boldsymbol{y}}_{na,k}^i)^\top$ sums up to 1 and represents a normalized attention distribution over all positions.

Figures (15)

  • Figure 1: Left: BD-Rate-GPU Memory Consumption during inference on CLIC Professional Valid CLIC2020 with 2K resolution. Our MLIC$^{++}$ achieves a better trade-off between performance and GPU memory consumption. Right: Reconstruction comparison on "vita-vilcina-3055" from CLIC Professional Valid CLIC2020 dataset. The reconstruction of MLIC$^{++}$ has the best visual quality.
  • Figure 2: Visualization of channels of latent representation of Kodim19 extracted by Cheng'20 cheng2020learned (optimized for MSE, $\lambda=0.0483$) to illustrate channel-wise redundancy. These channels are nearest-neighbor upsampled for visualization.
  • Figure 3: Heatmap of spatial cosine similarity of latent representation of Kodim19 extracted by Cheng'20 cheng2020learned (optimized for MSE, $\lambda=0.0483$) to visualize global spatial and local spatial redundancy. The heatmap is nearest-neighbor upsampled for visualization.
  • Figure 4: The overall architecture of MLIC$^{++}$. $\downarrow$ means down-sampling. $\uparrow$ means up-sampling. / means stride equals $1$. Red line is the dataflow during decoding. ${\boldsymbol{x}}$ is the input image and $\hat{\boldsymbol x}$ is the reconstructed image. $Q$ is quantization. $AE$ is arithmetic encoding. $AD$ is arithmetic decoding. $\boldsymbol{y}$ is the latent representation and $\hat{\boldsymbol{y}}$ is the quantized latent representation. $\hat{\boldsymbol y}^i$ is the $i$-th slice of $\hat{\boldsymbol{y}}$.
  • Figure 5: Linear Multi-Reference Entropy Model MEM$^{++}$. The figure illustrates the process of decoding a slice $\hat{\boldsymbol{y}}^i$.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof