Table of Contents
Fetching ...

Window-based Channel Attention for Wavelet-enhanced Learned Image Compression

Heng Xu, Bowen Hai, Yushun Tang, Zhihai He

TL;DR

This work tackles receptive-field limitations in LIC by introducing a Space-Channel Hybrid (SCH) framework that combines local spatial modeling with global channel-wise attention. The key innovations are a window-based channel attention module, which enlarges receptive fields by operating attention within non-overlapping windows, and a Haar Discrete Wavelet Transform (DWT) module that provides parameter-free, frequency-aware down-sampling to further expand global information capture. Empirical results on Kodak, Tecnick, and two CLIC datasets show state-of-the-art BD-rate reductions up to $-24.71\%$, with up to $\sim$0.31 dB PSNR gains in RD performance, while maintaining competitive encoding/decoding efficiency. The approach demonstrates that combining window-based channel attention with frequency-domain down-sampling yields substantial gains in LIC performance and suggests avenues for further optimization and mobile deployment through model compression.

Abstract

Learned Image Compression (LIC) models have achieved superior rate-distortion performance than traditional codecs. Existing LIC models use CNN, Transformer, or Mixed CNN-Transformer as basic blocks. However, limited by the shifted window attention, Swin-Transformer-based LIC exhibits a restricted growth of receptive fields, affecting the ability to model large objects for image compression. To address this issue and improve the performance, we incorporate window partition into channel attention for the first time to obtain large receptive fields and capture more global information. Since channel attention hinders local information learning, it is important to extend existing attention mechanisms in Transformer codecs to the space-channel attention to establish multiple receptive fields, being able to capture global correlations with large receptive fields while maintaining detailed characterization of local correlations with small receptive fields. We also incorporate the discrete wavelet transform into our Spatial-Channel Hybrid (SCH) framework for efficient frequency-dependent down-sampling and further enlarging receptive fields. Experiment results demonstrate that our method achieves state-of-the-art performances, reducing BD-rate by 18.54%, 23.98%, 22.33%, and 24.71% on four standard datasets compared to VTM-23.1.

Window-based Channel Attention for Wavelet-enhanced Learned Image Compression

TL;DR

This work tackles receptive-field limitations in LIC by introducing a Space-Channel Hybrid (SCH) framework that combines local spatial modeling with global channel-wise attention. The key innovations are a window-based channel attention module, which enlarges receptive fields by operating attention within non-overlapping windows, and a Haar Discrete Wavelet Transform (DWT) module that provides parameter-free, frequency-aware down-sampling to further expand global information capture. Empirical results on Kodak, Tecnick, and two CLIC datasets show state-of-the-art BD-rate reductions up to , with up to 0.31 dB PSNR gains in RD performance, while maintaining competitive encoding/decoding efficiency. The approach demonstrates that combining window-based channel attention with frequency-domain down-sampling yields substantial gains in LIC performance and suggests avenues for further optimization and mobile deployment through model compression.

Abstract

Learned Image Compression (LIC) models have achieved superior rate-distortion performance than traditional codecs. Existing LIC models use CNN, Transformer, or Mixed CNN-Transformer as basic blocks. However, limited by the shifted window attention, Swin-Transformer-based LIC exhibits a restricted growth of receptive fields, affecting the ability to model large objects for image compression. To address this issue and improve the performance, we incorporate window partition into channel attention for the first time to obtain large receptive fields and capture more global information. Since channel attention hinders local information learning, it is important to extend existing attention mechanisms in Transformer codecs to the space-channel attention to establish multiple receptive fields, being able to capture global correlations with large receptive fields while maintaining detailed characterization of local correlations with small receptive fields. We also incorporate the discrete wavelet transform into our Spatial-Channel Hybrid (SCH) framework for efficient frequency-dependent down-sampling and further enlarging receptive fields. Experiment results demonstrate that our method achieves state-of-the-art performances, reducing BD-rate by 18.54%, 23.98%, 22.33%, and 24.71% on four standard datasets compared to VTM-23.1.
Paper Structure (25 sections, 6 equations, 9 figures, 2 tables)

This paper contains 25 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Effective Receptive Fields (ERF) on kodim07 from modules of our SCH block. (a) and (b) are from he2016deepliu2021swin, while (c) and (d) are our window-based channel attention with and without wavelet transform. Results are normalized and clipped by a threshold of 0.3 for better visualization. The color changes from blue to red as the value increases.
  • Figure 2: The overall architecture of our model. SCH is Space-Channel Hybrid block. DWT is the discrete wavelet transform and IDWT is the inverse transform. $\downarrow$ means down-sampling and $\uparrow$ means up-sampling. RB is Residual Block. RBS is Residual Block with Stride. RBU is Residual Block Up-sampling. RE is Range Encoder and RD is Range Decoder.
  • Figure 3: The proposed SCH block (left), window-based space attention module (middle), and window-based channel attention module (right). In (b) and (c), SC Transpose is Space-Channel Dimension Transposition. Modules with similar functions are marked in the same colors. We indicate tensor shapes before and after shape-transforming modules, where $n$, $h$, and $w$ are the number of windows, window height and width.
  • Figure 4: Demonstration of window-based space attention and channel attention with window size $2 \times 2$ and channel size 5. In each window, (a) performs attention across space tokens and (b) performs attention across channel tokens. Different tokens are marked in different colors. The depth of the token is the actual channel size for computation.
  • Figure 5: Channel attention maps on kodim07 from our module and DaViT ding2022davit. Our window-based channel attention offers $n$ windows $H$ heads $C\times C$ maps, and we randomly select three maps \ref{['fig:ca0']}, \ref{['fig:ca1']} and \ref{['fig:ca2']} from different windows of the first head. DaViT offers $H$ heads $C_g\times C_g$ maps, where $C=H\times C_g$, and we visualize three maps \ref{['fig:ca3']}, \ref{['fig:ca4']} and \ref{['fig:ca5']} from three heads. $n$, $H$, and $C$ are 96, 8, and 128, respectively
  • ...and 4 more figures