Table of Contents
Fetching ...

MambaVC: Learned Visual Compression with Selective State Spaces

Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shutao Xia, Yaowei Wang

TL;DR

MambaVC introduces the first visual compression network built on selective state-space modeling, leveraging Visual State Space blocks and a 2D Selective Scan to capture global context efficiently. It achieves superior rate-distortion performance with lower computation and memory overhead than CNN- and Transformer-based rivals, with pronounced gains on high-resolution imagery. Extending to video via a SSF backbone, MambaVC-SSF shows competitive results and highlights remaining gaps to reach state-of-the-art video codecs. Overall, the work demonstrates the promise of state-space models for scalable, high-quality visual compression and provides a thorough comparative analysis of design choices. The approach offers a principled path toward efficient, high-fidelity compression in real-world applications such as HD medical imaging and satellite data transmission.

Abstract

Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at https://github.com/QinSY123/2024-MambaVC.

MambaVC: Learned Visual Compression with Selective State Spaces

TL;DR

MambaVC introduces the first visual compression network built on selective state-space modeling, leveraging Visual State Space blocks and a 2D Selective Scan to capture global context efficiently. It achieves superior rate-distortion performance with lower computation and memory overhead than CNN- and Transformer-based rivals, with pronounced gains on high-resolution imagery. Extending to video via a SSF backbone, MambaVC-SSF shows competitive results and highlights remaining gaps to reach state-of-the-art video codecs. Overall, the work demonstrates the promise of state-space models for scalable, high-quality visual compression and provides a thorough comparative analysis of design choices. The approach offers a principled path toward efficient, high-fidelity compression in real-world applications such as HD medical imaging and satellite data transmission.

Abstract

Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at https://github.com/QinSY123/2024-MambaVC.
Paper Structure (35 sections, 14 equations, 15 figures, 3 tables)

This paper contains 35 sections, 14 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: (a) MambaVC achieves the best BD rate with the least computation and memory overhead. See \ref{['Variant Visual Compression Performance']} and \ref{['Computational complexity']} for more details. (b) The improvements of MambaVC over other designs becomes more pronounced with increasing resolutions.
  • Figure 2: (a) Overview of MambaVC. CAM is channel-wise auto-regressive entropy model liu2023learned. FM is factorized entropy model. $\text{Conv}(N, 2)\downarrow$ and $\text{Deconv}(N, 2)\uparrow$ represent strided down convolution and strided up convolution with $N{\times}N$ filters, respectively. (b) A VSS block consists of several layers. Each layer includes a 2DSS module, which performs selective scans in 4 parallel patterns.
  • Figure 3: Performance evaluation (optimized by MSE) on the Kodak dataset franzen1999kodak.
  • Figure 4: Video compression performance evaluation on benchmark datasets.
  • Figure 5: Latent correlation of $(\bm{y} - \bm{\mu}) / \bm{\sigma}$. All models are trained with $\lambda=0.0067$. The value at position $(i,j)$ represents cross-correlation between spatial locations $(x,y)$ and $(x+i,y+j)$ along the channel dimension, averaged across all images on Kodak franzen1999kodak.
  • ...and 10 more figures