Table of Contents
Fetching ...

CMamba: Learned Image Compression with State Space Models

Zhuojie Wu, Heming Du, Shuyun Wang, Ming Lu, Haiyang Sun, Yandong Guo, Xin Yu

TL;DR

CMamba tackles the trade-off between rate-distortion performance and computational efficiency in Learned Image Compression by marrying CNNs with State Space Models. It introduces a Content-Adaptive SSM (CA-SSM) that dynamically fuses global content from SSMs with local details from CNNs, and a Context-Aware Entropy (CAE) module that jointly models spatial and channel dependencies to optimize entropy coding. Empirical results on Kodak, Tecnick, and CLIC show BD-Rate reductions relative to VVC and state-of-the-art LIC methods, alongside substantial reductions in parameters, FLOPs, and decoding time on Kodak. This hybrid approach demonstrates that selective scanning and autoregressive channel modeling can achieve practical, scalable compression without sacrificing quality.

Abstract

Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propose a hybrid Convolution and State Space Models (SSMs) based image compression framework, termed \textit{CMamba}, to achieve superior rate-distortion performance with low computational complexity. Specifically, CMamba introduces two key components: a Content-Adaptive SSM (CA-SSM) module and a Context-Aware Entropy (CAE) module. First, we observed that SSMs excel in modeling overall content but tend to lose high-frequency details. In contrast, CNNs are proficient at capturing local details. Motivated by this, we propose the CA-SSM module that can dynamically fuse global content extracted by SSM blocks and local details captured by CNN blocks in both encoding and decoding stages. As a result, important image content is well preserved during compression. Second, our proposed CAE module is designed to reduce spatial and channel redundancies in latent representations after encoding. Specifically, our CAE leverages SSMs to parameterize the spatial content in latent representations. Benefiting from SSMs, CAE significantly improves spatial compression efficiency while reducing spatial content redundancies. Moreover, along the channel dimension, CAE reduces inter-channel redundancies of latent representations via an autoregressive manner, which can fully exploit prior knowledge from previous channels without sacrificing efficiency. Experimental results demonstrate that CMamba achieves superior rate-distortion performance.

CMamba: Learned Image Compression with State Space Models

TL;DR

CMamba tackles the trade-off between rate-distortion performance and computational efficiency in Learned Image Compression by marrying CNNs with State Space Models. It introduces a Content-Adaptive SSM (CA-SSM) that dynamically fuses global content from SSMs with local details from CNNs, and a Context-Aware Entropy (CAE) module that jointly models spatial and channel dependencies to optimize entropy coding. Empirical results on Kodak, Tecnick, and CLIC show BD-Rate reductions relative to VVC and state-of-the-art LIC methods, alongside substantial reductions in parameters, FLOPs, and decoding time on Kodak. This hybrid approach demonstrates that selective scanning and autoregressive channel modeling can achieve practical, scalable compression without sacrificing quality.

Abstract

Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propose a hybrid Convolution and State Space Models (SSMs) based image compression framework, termed \textit{CMamba}, to achieve superior rate-distortion performance with low computational complexity. Specifically, CMamba introduces two key components: a Content-Adaptive SSM (CA-SSM) module and a Context-Aware Entropy (CAE) module. First, we observed that SSMs excel in modeling overall content but tend to lose high-frequency details. In contrast, CNNs are proficient at capturing local details. Motivated by this, we propose the CA-SSM module that can dynamically fuse global content extracted by SSM blocks and local details captured by CNN blocks in both encoding and decoding stages. As a result, important image content is well preserved during compression. Second, our proposed CAE module is designed to reduce spatial and channel redundancies in latent representations after encoding. Specifically, our CAE leverages SSMs to parameterize the spatial content in latent representations. Benefiting from SSMs, CAE significantly improves spatial compression efficiency while reducing spatial content redundancies. Moreover, along the channel dimension, CAE reduces inter-channel redundancies of latent representations via an autoregressive manner, which can fully exploit prior knowledge from previous channels without sacrificing efficiency. Experimental results demonstrate that CMamba achieves superior rate-distortion performance.

Paper Structure

This paper contains 16 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The Fourier spectrum comparisons between SSMs and CNNs. (a) The Fourier spectrum of features obtained from the SSM-based method\ref{['foot1']} and the CNN-based method (ChARM) minnen2020channel in the last block of the analysis transform $g_a(\cdot)$. (b) Relative log amplitudes of Fourier transformed feature maps\ref{['foot2']} for different methods. $\Delta$ log amplitude values indicate the averaged output of each block in $g_a(\cdot)$ on the Kodak dataset.
  • Figure 2: (a) Overview of our proposed method. (b) Detailed design of our proposed Content-Adaptive SSM (CA-SSM) module. The CA-SSM module has two parallel paths (i.e., VSS block and ResBlock) to capture global content and local details, and then fuses these features dynamically. (c) The detailed network architecture of our Context-Aware Entropy (CAE) module. The CAE module jointly models spatial and channel dependencies in latent representations $y$.
  • Figure 3: PSNR-Bitrate curves evaluated on Kodak, Tecnick, and CLIC datasets. The compared methods include state-of-the-art LIC models and handcrafted codecs. LIC models are optimized with MSE.
  • Figure 4: Visual comparison of the decompressed kodim24.png image from the Kodak dataset using various compression methods. Opt.MSE and Opt.MS-SSIM indicate that a model is optimized with MSE and MS-SSIM, respectively. More visual comparisons are provided in the supplementary materials.
  • Figure 5: Rate-distortion performance evaluated on the Kodak dataset. All the models are optimized with MS-SSIM.
  • ...and 1 more figures