Table of Contents
Fetching ...

Linear Attention Modeling for Learned Image Compression

Donghui Feng, Zhengxue Cheng, Shen Wang, Ronghua Wu, Hongwei Hu, Guo Lu, Li Song

TL;DR

The paper addresses the computational burden of learned image compression by introducing LALIC, a linear-attention LIC architecture that leverages Bi-RWKV transform blocks with Spatial-Mix and Channel-Mix, augmented by an Omni-Shift layer to handle 2D latent representations. It introduces RWKV-SCCTX for entropy modeling, effectively capturing spatial and channel dependencies. Empirically, LALIC achieves competitive rate-distortion performance, outperforming VTM-9.1 by substantial BD-rate margins on Kodak, CLIC, and Tecnick while maintaining moderate decoding speed and parameter count. This work demonstrates that linear-attention models can match or exceed transformer-based LIC performance with improved efficiency, enabling practical high-resolution image compression.

Abstract

Recent years, learned image compression has made tremendous progress to achieve impressive coding efficiency. Its coding gain mainly comes from non-linear neural network-based transform and learnable entropy modeling. However, most studies focus on a strong backbone, and few studies consider a low complexity design. In this paper, we propose LALIC, a linear attention modeling for learned image compression. Specially, we propose to use Bi-RWKV blocks, by utilizing the Spatial Mix and Channel Mix modules to achieve more compact feature extraction, and apply the Conv based Omni-Shift module to adapt to two-dimensional latent representation. Furthermore, we propose a RWKV-based Spatial-Channel ConTeXt model (RWKV-SCCTX), that leverages the Bi-RWKV to modeling the correlation between neighboring features effectively. To our knowledge, our work is the first work to utilize efficient Bi-RWKV models with linear attention for learned image compression. Experimental results demonstrate that our method achieves competitive RD performances by outperforming VTM-9.1 by -15.26%, -15.41%, -17.63% in BD-rate on Kodak, CLIC and Tecnick datasets. The code is available at https://github.com/sjtu-medialab/RwkvCompress .

Linear Attention Modeling for Learned Image Compression

TL;DR

The paper addresses the computational burden of learned image compression by introducing LALIC, a linear-attention LIC architecture that leverages Bi-RWKV transform blocks with Spatial-Mix and Channel-Mix, augmented by an Omni-Shift layer to handle 2D latent representations. It introduces RWKV-SCCTX for entropy modeling, effectively capturing spatial and channel dependencies. Empirically, LALIC achieves competitive rate-distortion performance, outperforming VTM-9.1 by substantial BD-rate margins on Kodak, CLIC, and Tecnick while maintaining moderate decoding speed and parameter count. This work demonstrates that linear-attention models can match or exceed transformer-based LIC performance with improved efficiency, enabling practical high-resolution image compression.

Abstract

Recent years, learned image compression has made tremendous progress to achieve impressive coding efficiency. Its coding gain mainly comes from non-linear neural network-based transform and learnable entropy modeling. However, most studies focus on a strong backbone, and few studies consider a low complexity design. In this paper, we propose LALIC, a linear attention modeling for learned image compression. Specially, we propose to use Bi-RWKV blocks, by utilizing the Spatial Mix and Channel Mix modules to achieve more compact feature extraction, and apply the Conv based Omni-Shift module to adapt to two-dimensional latent representation. Furthermore, we propose a RWKV-based Spatial-Channel ConTeXt model (RWKV-SCCTX), that leverages the Bi-RWKV to modeling the correlation between neighboring features effectively. To our knowledge, our work is the first work to utilize efficient Bi-RWKV models with linear attention for learned image compression. Experimental results demonstrate that our method achieves competitive RD performances by outperforming VTM-9.1 by -15.26%, -15.41%, -17.63% in BD-rate on Kodak, CLIC and Tecnick datasets. The code is available at https://github.com/sjtu-medialab/RwkvCompress .

Paper Structure

This paper contains 21 sections, 8 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: BD-rate vs Decoding Latency on Kodak dataset, where our proposed LALIC achieves the competitive BD-rate with moderate complexity. The Left-Top is better.
  • Figure 2: (a) Overview of proposed Linear Attention based Learned Image Compression (LALIC). Conv$(N,2)\downarrow$ and Deconv$(N,2)\uparrow$ represent strided down convolution and strided up convolution with $N \times N$ filters, respectively. There are $L$ identical Bi-RWKV Blocks stacked after downsample or upsample conv layer. AE, AD, and Q represent Arithmetic Encoding, Arithmetic Decoding, and Quantization. RWKV-SCCTX is the proposed RWKV-based Space-Channel Context model, illustrate in Fig.\ref{['fig:entropy-model']}. (b) The details of the Bi-RWKV Block. Omni-ShiftYang.2024.Restore-RWKV denotes a reparameterized 5x5 depthwise convolution to capture local context. And BiWKV is the Bidirectional Attention proposed by Duan.2024.Vision-RWKV.
  • Figure 3: The effective receptive field (ERF) Luo.2016.ERF visualization for the forward pass ($g_a$ & $h_a$) of different models. A more extensively distributed dark area indicates a larger ERF.
  • Figure 4: Diagram of the RWKV Spatial-Channel Context Model.
  • Figure 5: Rate-distortion performance on the Kodak dataset.
  • ...and 8 more figures