Table of Contents
Fetching ...

Frequency-Aware Transformer for Learned Image Compression

Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong

TL;DR

This work tackles redundancy in learned image compression by introducing a frequency-aware transformer (FAT) that decomposes and exploits image frequency components for nonlinear transforms. The FAT block combines Frequency-Decomposition Window Attention (FDWA) to capture multiscale directional frequencies and a Frequency-Modulation FFN (FMFFN) to adaptively weight frequency bands, while a Transformer-based Channel-wise Autoregressive (T-CA) entropy model leverages channel dependencies for precise distribution estimation. The method achieves state-of-the-art rate-distortion performance on standard datasets, notably delivering BD-rate reductions of around 13–15% compared to VTM-12.1, validating the effectiveness of explicit frequency-aware analysis in LIC. Overall, the frequency-aware transformer framework advances learned image compression by integrating multiscale directional analysis with end-to-end optimization for RD efficiency and practical applicability.

Abstract

Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in BD-rate on the Kodak, Tecnick, and CLIC datasets.

Frequency-Aware Transformer for Learned Image Compression

TL;DR

This work tackles redundancy in learned image compression by introducing a frequency-aware transformer (FAT) that decomposes and exploits image frequency components for nonlinear transforms. The FAT block combines Frequency-Decomposition Window Attention (FDWA) to capture multiscale directional frequencies and a Frequency-Modulation FFN (FMFFN) to adaptively weight frequency bands, while a Transformer-based Channel-wise Autoregressive (T-CA) entropy model leverages channel dependencies for precise distribution estimation. The method achieves state-of-the-art rate-distortion performance on standard datasets, notably delivering BD-rate reductions of around 13–15% compared to VTM-12.1, validating the effectiveness of explicit frequency-aware analysis in LIC. Overall, the frequency-aware transformer framework advances learned image compression by integrating multiscale directional analysis with end-to-end optimization for RD efficiency and practical applicability.

Abstract

Learned image compression (LIC) has gained traction as an effective solution for image storage and transmission in recent years. However, existing LIC methods are redundant in latent representation due to limitations in capturing anisotropic frequency components and preserving directional details. To overcome these challenges, we propose a novel frequency-aware transformer (FAT) block that for the first time achieves multiscale directional ananlysis for LIC. The FAT block comprises frequency-decomposition window attention (FDWA) modules to capture multiscale and directional frequency components of natural images. Additionally, we introduce frequency-modulation feed-forward network (FMFFN) to adaptively modulate different frequency components, improving rate-distortion performance. Furthermore, we present a transformer-based channel-wise autoregressive (T-CA) model that effectively exploits channel dependencies. Experiments show that our method achieves state-of-the-art rate-distortion performance compared to existing LIC methods, and evidently outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in BD-rate on the Kodak, Tecnick, and CLIC datasets.
Paper Structure (28 sections, 6 equations, 13 figures, 6 tables)

This paper contains 28 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Illustration of the proposed frequency-decomposition widow attention (FDWA) that realizes multiscale and directional decomposition. The first column shows diverse window shapes for capturing different frequency components, the second column visualizes the extracted features, and the third column presents the Fourier spectrum of the features.
  • Figure 2: Overview of the proposed Frequency-aware Transformer-based learned Image Compression (FTIC) model. Multiple residual blocks with stride (RBS), residual block Upsampling (RBU), and frequency-aware transformer (FAT) blocks are employed in building the nonlinear transforms (i.e., analysis transform $g_a(\cdot)$ and synthesis transform $g_s(\cdot)$). For each two concatenated FAT blocks, the first one employs the regular frequency-decomposed window attention (FDWA) and the second one performs shift-window operations.
  • Figure 3: Proposed Transformer-based Channel-wise Autoregressive (T-CA) entropy model, the hyperprior path is also included. For briefness, we suppose each slice has 3 channels in (a). GConv $n\times n$ denotes the group convolutions with kernel size of $n\times n$.
  • Figure 4: R-D performance evaluated on the Kodak dataset. The compared methods include state-of-the-art LIC models and handcrafted image codecs. Left: PSNR; right: MS-SSIM.
  • Figure 5: Frequency intensity (16 × 16) from the output of FDWA at the last FAT block for both (a) analysis transform $g_a(\cdot)$ and (b) synthesis transform $g_s(\cdot)$. The model is trained with $\lambda$ set as 0.0483 and MSE as metric. We show 6 output channels for each of LL-WA, HH-WA, HH-WA, and LH-WA. The magnitude values are averaged over 100 samples. Lighter colors indicate larger magnitudes, while pixels closer to the center represent lower frequencies.
  • ...and 8 more figures