Table of Contents
Fetching ...

Bi-Level Spatial and Channel-aware Transformer for Learned Image Compression

Hamidreza Soltani, Erfan Ghasemi

TL;DR

The paper tackles inefficiencies in learned image compression by incorporating frequency-aware processing into a Transformer-based transform stage.It introduces the Hybrid Spatial-Channel Attention Transformer Block (HSCATB), which combines Spatial-aware Self-Attention (SaSA), Channel-aware Self-Attention (CaSA), and the Mixed Local-Global FFN (MLGFFN) to better decorrelate latent representations.A channel-wise entropy model with uneven channel chunking and differentiable quantization are employed to tighten bitrate estimation and enable end-to-end training.Empirical results on Kodak demonstrate state-of-the-art rate-distortion performance, outperforming both traditional codecs and contemporary LIC methods, with clear gains attributed to the frequency-aware attention and multi-scale FFN design.

Abstract

Recent advancements in learned image compression (LIC) methods have demonstrated superior performance over traditional hand-crafted codecs. These learning-based methods often employ convolutional neural networks (CNNs) or Transformer-based architectures. However, these nonlinear approaches frequently overlook the frequency characteristics of images, which limits their compression efficiency. To address this issue, we propose a novel Transformer-based image compression method that enhances the transformation stage by considering frequency components within the feature map. Our method integrates a novel Hybrid Spatial-Channel Attention Transformer Block (HSCATB), where a spatial-based branch independently handles high and low frequencies at the attention layer, and a Channel-aware Self-Attention (CaSA) module captures information across channels, significantly improving compression performance. Additionally, we introduce a Mixed Local-Global Feed Forward Network (MLGFFN) within the Transformer block to enhance the extraction of diverse and rich information, which is crucial for effective compression. These innovations collectively improve the transformation's ability to project data into a more decorrelated latent space, thereby boosting overall compression efficiency. Experimental results demonstrate that our framework surpasses state-of-the-art LIC methods in rate-distortion performance.

Bi-Level Spatial and Channel-aware Transformer for Learned Image Compression

TL;DR

The paper tackles inefficiencies in learned image compression by incorporating frequency-aware processing into a Transformer-based transform stage.It introduces the Hybrid Spatial-Channel Attention Transformer Block (HSCATB), which combines Spatial-aware Self-Attention (SaSA), Channel-aware Self-Attention (CaSA), and the Mixed Local-Global FFN (MLGFFN) to better decorrelate latent representations.A channel-wise entropy model with uneven channel chunking and differentiable quantization are employed to tighten bitrate estimation and enable end-to-end training.Empirical results on Kodak demonstrate state-of-the-art rate-distortion performance, outperforming both traditional codecs and contemporary LIC methods, with clear gains attributed to the frequency-aware attention and multi-scale FFN design.

Abstract

Recent advancements in learned image compression (LIC) methods have demonstrated superior performance over traditional hand-crafted codecs. These learning-based methods often employ convolutional neural networks (CNNs) or Transformer-based architectures. However, these nonlinear approaches frequently overlook the frequency characteristics of images, which limits their compression efficiency. To address this issue, we propose a novel Transformer-based image compression method that enhances the transformation stage by considering frequency components within the feature map. Our method integrates a novel Hybrid Spatial-Channel Attention Transformer Block (HSCATB), where a spatial-based branch independently handles high and low frequencies at the attention layer, and a Channel-aware Self-Attention (CaSA) module captures information across channels, significantly improving compression performance. Additionally, we introduce a Mixed Local-Global Feed Forward Network (MLGFFN) within the Transformer block to enhance the extraction of diverse and rich information, which is crucial for effective compression. These innovations collectively improve the transformation's ability to project data into a more decorrelated latent space, thereby boosting overall compression efficiency. Experimental results demonstrate that our framework surpasses state-of-the-art LIC methods in rate-distortion performance.
Paper Structure (20 sections, 12 equations, 3 figures, 2 tables)

This paper contains 20 sections, 12 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Left: This illustrates the overall framework of our proposed neural image compression model. The model utilizes multiple downsample blocks, upsample blocks, and Hybrid Spatial-Channel Attention Transformer Blocks (HSCATB) to build the nonlinear transforms (i.e., analysis transform $g_a(.)$ and synthesis transform $g_s(.)$). Right: This shows the architectures of the downsample and upsample blocks.
  • Figure 2: (a) Spatial-aware Self-Attention (SaSA), which is composed of low-frequency and high-frequency paths. (b) Channel-aware Self-Attention (CaSA). (c) Hybrid Spatial- Channel Attention Transformer Block (HSCATB). (d) Mixed Local-Global Feed Forward Network (MLGFFN).
  • Figure 3: Rate-distortion performance assessed using the Kodak dataset.