Table of Contents
Fetching ...

LLIC: Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression

Wei Jiang, Peirong Ning, Jiayu Yang, Yongqi Zhai, Feng Gao, Ronggang Wang

TL;DR

This work addresses the limited receptive field and rigidity of weights in learned image compression transforms by introducing Large Receptive Field Transform Coding with Adaptive Weights (LLIC). It combines Spatial Transform Blocks with large depthwise kernels (11×11/9×9) and self-conditioned weight generation (SCST) to enlarge the effective receptive field, and Channel Transform Blocks with self-conditioned channel factors (SCCT) to adaptively allocate bits across channels. The proposed STB/CTB framework, augmented by nonlinear DepthRB, a gate mechanism, and an improved two-stage training strategy using large patches, yields substantial BD-Rate reductions on Kodak relative to VTM-17.0 Intra (approximately 9.5–11%), while offering favorable memory and compute characteristics compared to several baselines. Overall, LLIC achieves state-of-the-art rate-distortion performance with better performance/complexity trade-offs, particularly for high-resolution images, indicating strong practical potential for learned image compression.

Abstract

The effective receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERFs remain insufficiently large, or heavy non-local attention mechanisms, which limit the potential of high-resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the first time in the learned image compression community, we introduce a few large kernelbased depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to the wide range of image diversity, we further propose a mechanism to augment convolution adaptability through the self-conditioned generation of weights. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter pointwise interactions. Our investigation extends to refined training methods that unlock the full potential of these large kernels. Moreover, to promote more dynamic inter-channel interactions, we introduce an adaptive channel-wise bit allocation strategy that autonomously generates channel importance factors in a self-conditioned manner. To demonstrate the effectiveness of the proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, and LLIC-TCM. Extensive experiments demonstrate that our proposed LLIC models have significant improvements over the corresponding baselines and reduce the BD-Rate by 9.49%, 9.47%, 10.94% on Kodak over VTM-17.0 Intra, respectively. Our LLIC models achieve state-of-the-art performances and better trade-offs between performance and complexity.

LLIC: Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression

TL;DR

This work addresses the limited receptive field and rigidity of weights in learned image compression transforms by introducing Large Receptive Field Transform Coding with Adaptive Weights (LLIC). It combines Spatial Transform Blocks with large depthwise kernels (11×11/9×9) and self-conditioned weight generation (SCST) to enlarge the effective receptive field, and Channel Transform Blocks with self-conditioned channel factors (SCCT) to adaptively allocate bits across channels. The proposed STB/CTB framework, augmented by nonlinear DepthRB, a gate mechanism, and an improved two-stage training strategy using large patches, yields substantial BD-Rate reductions on Kodak relative to VTM-17.0 Intra (approximately 9.5–11%), while offering favorable memory and compute characteristics compared to several baselines. Overall, LLIC achieves state-of-the-art rate-distortion performance with better performance/complexity trade-offs, particularly for high-resolution images, indicating strong practical potential for learned image compression.

Abstract

The effective receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERFs remain insufficiently large, or heavy non-local attention mechanisms, which limit the potential of high-resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the first time in the learned image compression community, we introduce a few large kernelbased depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to the wide range of image diversity, we further propose a mechanism to augment convolution adaptability through the self-conditioned generation of weights. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter pointwise interactions. Our investigation extends to refined training methods that unlock the full potential of these large kernels. Moreover, to promote more dynamic inter-channel interactions, we introduce an adaptive channel-wise bit allocation strategy that autonomously generates channel importance factors in a self-conditioned manner. To demonstrate the effectiveness of the proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, and LLIC-TCM. Extensive experiments demonstrate that our proposed LLIC models have significant improvements over the corresponding baselines and reduce the BD-Rate by 9.49%, 9.47%, 10.94% on Kodak over VTM-17.0 Intra, respectively. Our LLIC models achieve state-of-the-art performances and better trade-offs between performance and complexity.
Paper Structure (35 sections, 11 equations, 10 figures, 3 tables)

This paper contains 35 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: BD-Rate-peak GPU Memory Consumption during testing on CLIC Pro Valid CLIC2020 with 2K resolution. Our LLIC-ELIC achieves a better trade-off between performance and GPU memory consumption.
  • Figure 2: Network architecture of our LLIC-STF, LLIC-ELIC, and LLIC-TCM. $g_a$ is the analysis transform. $g_s$ is the synthesis transform. $Q$ is the quantization. $\mu$ and $\sigma$ are the estimated mean and scale of latent $\hat{\boldsymbol{y}}$ for probability estimation. Following baseline models, the latent representation $\boldsymbol{y}$ subtracts the means $\boldsymbol{\mu}$ for quantization before arithmetic encoding (AE), and the decoded residual $Q(\boldsymbol{y} - \boldsymbol{\mu})$ adds the means $\boldsymbol{\mu}$ after arithmetic decoding (AD). $N=192, M=320$.
  • Figure 3: Architecture of the proposed basic block. STB is the proposed Spatial Transform Block. CTB is the proposed Channel Transform Block. DepthRB is the depth-wise residual block for non-linear embedding. Gate is the proposed Gate Block. $\boldsymbol{\mathcal{F}}_{in}^{stb}$ is the input of STB. $\boldsymbol{\mathcal{F}}_{in}^{ctb}$ is the input of CTB. In STB, we employ large kernels to capture more spatial contexts, and the kernel size $K$ is set to $11$ or $9$ in our method.
  • Figure 4: PSNR-Bit-Rate curves and Rate saving-PSNR curves of our proposed LLIC-STF and its baseline STF zou2022the. The relative rate-saving curves are generated by first interpolating the discrete RD points with a cubic spline and then comparing the bitrates of different models at a fixed PSNR.
  • Figure 5: PSNR-Bit-Rate curves and Rate saving-PSNR curves of our proposed LLIC-ELIC and its baseline ELIC he2022elic.
  • ...and 5 more figures