MambaVC: Learned Visual Compression with Selective State Spaces

Shiyu Qin; Jinpeng Wang; Yimin Zhou; Bin Chen; Tianci Luo; Baoyi An; Tao Dai; Shutao Xia; Yaowei Wang

MambaVC: Learned Visual Compression with Selective State Spaces

Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shutao Xia, Yaowei Wang

TL;DR

MambaVC introduces the first visual compression network built on selective state-space modeling, leveraging Visual State Space blocks and a 2D Selective Scan to capture global context efficiently. It achieves superior rate-distortion performance with lower computation and memory overhead than CNN- and Transformer-based rivals, with pronounced gains on high-resolution imagery. Extending to video via a SSF backbone, MambaVC-SSF shows competitive results and highlights remaining gaps to reach state-of-the-art video codecs. Overall, the work demonstrates the promise of state-space models for scalable, high-quality visual compression and provides a thorough comparative analysis of design choices. The approach offers a principled path toward efficient, high-fidelity compression in real-world applications such as HD medical imaging and satellite data transmission.

Abstract

Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at https://github.com/QinSY123/2024-MambaVC.

MambaVC: Learned Visual Compression with Selective State Spaces

TL;DR

Abstract

Paper Structure (35 sections, 14 equations, 15 figures, 3 tables)

This paper contains 35 sections, 14 equations, 15 figures, 3 tables.

Introduction
Related Works
Method
Preliminaries: State-State Models and Mamba
The proposed MambaVC
Overview
Visual State Space (VSS) Block
2D Selective Scan (2DSS)
Extension to Video Compression
Experiments
Experimental Setup
Datasets and Training Details
Baselines
Standard Image Compression
Comparison with the State-of-the-art Methods
...and 20 more sections

Figures (15)

Figure 1: (a) MambaVC achieves the best BD rate with the least computation and memory overhead. See \ref{['Variant Visual Compression Performance']} and \ref{['Computational complexity']} for more details. (b) The improvements of MambaVC over other designs becomes more pronounced with increasing resolutions.
Figure 2: (a) Overview of MambaVC. CAM is channel-wise auto-regressive entropy model liu2023learned. FM is factorized entropy model. $\text{Conv}(N, 2)\downarrow$ and $\text{Deconv}(N, 2)\uparrow$ represent strided down convolution and strided up convolution with $N{\times}N$ filters, respectively. (b) A VSS block consists of several layers. Each layer includes a 2DSS module, which performs selective scans in 4 parallel patterns.
Figure 3: Performance evaluation (optimized by MSE) on the Kodak dataset franzen1999kodak.
Figure 4: Video compression performance evaluation on benchmark datasets.
Figure 5: Latent correlation of $(\bm{y} - \bm{\mu}) / \bm{\sigma}$. All models are trained with $\lambda=0.0067$. The value at position $(i,j)$ represents cross-correlation between spatial locations $(x,y)$ and $(x+i,y+j)$ along the channel dimension, averaged across all images on Kodak franzen1999kodak.
...and 10 more figures

MambaVC: Learned Visual Compression with Selective State Spaces

TL;DR

Abstract

MambaVC: Learned Visual Compression with Selective State Spaces

Authors

TL;DR

Abstract

Table of Contents

Figures (15)