Table of Contents
Fetching ...

Enhancing Learned Image Compression via Cross Window-based Attention

Priyanka Mudgal, Feng Liu

TL;DR

The paper tackles local redundancy limitations in learned image compression by introducing a CNN-based LIC framework equipped with a feature encoding module and a cross window-based attention module (CWAM). CWAM expands the receptive field through cross-scale interactions, while the feature encoding block enhances representation of challenging image regions; both components are modular and can augment existing architectures. Empirical results on Kodak and CLIC show competitive rate-distortion performance, with MSE-optimized models outperforming several baselines, and ablation studies confirming the benefits of CWAM and feature encoding. The work highlights a practical path to improved LIC that balances performance with potential increases in complexity, and provides code for reproducibility and further optimization.

Abstract

In recent years, learned image compression methods have demonstrated superior rate-distortion performance compared to traditional image compression methods. Recent methods utilize convolutional neural networks (CNN), variational autoencoders (VAE), invertible neural networks (INN), and transformers. Despite their significant contributions, a main drawback of these models is their poor performance in capturing local redundancy. Therefore, to leverage global features along with local redundancy, we propose a CNN-based solution integrated with a feature encoding module. The feature encoding module encodes important features before feeding them to the CNN and then utilizes cross-scale window-based attention, which further captures local redundancy. Cross-scale window-based attention is inspired by the attention mechanism in transformers and effectively enlarges the receptive field. Both the feature encoding module and the cross-scale window-based attention module in our architecture are flexible and can be incorporated into any other network architecture. We evaluate our method on the Kodak and CLIC datasets and demonstrate that our approach is effective and on par with state-of-the-art methods. Our code is publicly available at https://github.com/prmudgal/CWAM_IC_ISVC. .

Enhancing Learned Image Compression via Cross Window-based Attention

TL;DR

The paper tackles local redundancy limitations in learned image compression by introducing a CNN-based LIC framework equipped with a feature encoding module and a cross window-based attention module (CWAM). CWAM expands the receptive field through cross-scale interactions, while the feature encoding block enhances representation of challenging image regions; both components are modular and can augment existing architectures. Empirical results on Kodak and CLIC show competitive rate-distortion performance, with MSE-optimized models outperforming several baselines, and ablation studies confirming the benefits of CWAM and feature encoding. The work highlights a practical path to improved LIC that balances performance with potential increases in complexity, and provides code for reproducibility and further optimization.

Abstract

In recent years, learned image compression methods have demonstrated superior rate-distortion performance compared to traditional image compression methods. Recent methods utilize convolutional neural networks (CNN), variational autoencoders (VAE), invertible neural networks (INN), and transformers. Despite their significant contributions, a main drawback of these models is their poor performance in capturing local redundancy. Therefore, to leverage global features along with local redundancy, we propose a CNN-based solution integrated with a feature encoding module. The feature encoding module encodes important features before feeding them to the CNN and then utilizes cross-scale window-based attention, which further captures local redundancy. Cross-scale window-based attention is inspired by the attention mechanism in transformers and effectively enlarges the receptive field. Both the feature encoding module and the cross-scale window-based attention module in our architecture are flexible and can be incorporated into any other network architecture. We evaluate our method on the Kodak and CLIC datasets and demonstrate that our approach is effective and on par with state-of-the-art methods. Our code is publicly available at https://github.com/prmudgal/CWAM_IC_ISVC. .

Paper Structure

This paper contains 16 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Visualization of decompressed images of kodim14 from Kodak dataset. It is demonstrated that our method with feature encoding module and cross-scale window-based attention is effectively compressing the image with better PSNR and optimized BPP. The subtitle shows "Method BPP$\downarrow$/PSNR$\uparrow$".
  • Figure 2: The end-to-end learned image compression architecture of Minnen2018JointAA. The analysis and synthesis $g_a$ and $g_s$ handles the transforms between image space and latent space of reduced dimension. Hyperprior analysis $h_a$ and synthesis $h_s$ transform captures the contextual information. The quantization $Q$ and the entropy coding and decoding $EC$ and $ED$ converts the latent vector into a compact binary stream. Context module $c_m$ and probability distribution of latent variables $p_{\hat{y}|\hat{z}}$ estimate the distribution of latent variable $\hat{y}$ conditioned on side information $\hat{z}$.
  • Figure 3: The architecture of our image compression network is based on 9190935. The analysis transform $g_a$ and synthesis transform $g_s$ convert variables from image space (x) to latent space (y) and from latent space ($\hat{y}$) to image space ($\hat{x}$) respectively. The feature encoding module enhances image features. The encoder and decoder consist of convolutional layers with 5 $\times$ 5 kernel and N channels (set to 320), GDN, and CWAM. IGDN represent the inverse GDN module. EC and ED represent the arithmetic encoder and arithmetic decoder, respectively. $h_a$ and $h_s$ are the hyperprior analysis and synthesis transforms implemented in Minnen et al. ballé2018variational. The residual block comprises of 1$\times$1 and 3$\times$3 convolutional layers with CWAM.
  • Figure 4: RD Performance on Kodak dataset, which contains 24 high quality images (top row) and on CLIC dataset, which contains 30 high resolution and high quality images (bottom row). Our method yields a much better performance when compared with state-of-the-art learned methods and traditional image compression standards. Also note that most images in the CLIC dataset are of high resolution, implying that our method is more robust and promising to compress high-resolution images.
  • Figure 5: Reconstructed images from Kodak dataset. The compressed image quality by our method shows better PSNR while maintaining or reducing the BPP in comparison to traditional methods. Subtitles represent BPP$\downarrow$/PSNR$\uparrow$.
  • ...and 2 more figures