Rotation Equivariant Mamba for Vision Tasks

Zhongchen Zhao; Qi Xie; Keyu Huang; Lei Zhang; Deyu Meng; Zongben Xu

Rotation Equivariant Mamba for Vision Tasks

Zhongchen Zhao, Qi Xie, Keyu Huang, Lei Zhang, Deyu Meng, Zongben Xu

TL;DR

EQ-VMamba is introduced, the first rotation equivariant visual Mamba architecture for vision tasks, and results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency.

Abstract

Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.

Rotation Equivariant Mamba for Vision Tasks

TL;DR

Abstract

Paper Structure (18 sections, 3 theorems, 16 equations, 10 figures, 8 tables)

This paper contains 18 sections, 3 theorems, 16 equations, 10 figures, 8 tables.

Introduction
Related Work
Visual Mamba
Rotation Equivariant Neural Networks
Rotation Equivariant Visual Mamba
Preliminaries
Overall Architecture of EQ-VMamba
Rotation Equivariant Tokenization
Rotation Equivariant Visual State-Space Block
Implementation of Rotation Equivariant Visual Mamba
Experimental Results
Image Classification
Semantic Segmentation
Classic Image Super-Resolution
Lightweight Image Super-Resolution
...and 3 more sections

Key Result

Theorem 1

Let ${\bm X} \in \mathbb{R}^{H \times W \times C \times 4}$ be a group-structured feature map and ${\bm x} \in \mathbb{R}^{HW \times C \times 4}$ be a group-structured feature sequence. Let $\tau(\cdot)$ and $\tau^{inv}(\cdot)$ denote the EQ-cross-scan and EQ-cross-merge operators defined in eq:EQ-C

Figures (10)

Figure 1: Visualization of output feature maps for an input image and its rotated version, comparing standard VMamba with our proposed EQ-VMamba.
Figure 2: Robustness comparison between VMamba, Spectral VMamba, and EQ-VMamba on the rotated ImageNet-100 dataset.
Figure 3: Illustration of the three fundamental transformation operators utilized within our equivariant framework.
Figure 4: Illustration of the equivariant property in an EQ-CNN layer. A spatial rotation applied to the input image induces a joint transformation on the output feature map: a corresponding spatial rotation and a cyclic shifting across the rotation group dimension.
Figure 5: The overall architecture of the proposed end-to-end rotation equivariant visual Mamba (EQ-VMamba). The framework mainly consists of: (a) an EQ-Patch Embedding module that tokenizes the input image into the group-structured feature maps, and (b) a stack of EQ-VSS blocks for hierarchical feature extraction. Each EQ-VSS block integrates an EQ-cross-scan operation for image-to-sequence flattening, group Mamba blocks for sequence modeling, and an EQ-cross-merge operation for sequence-to-image restoration.
...and 5 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2
Corollary 1

Rotation Equivariant Mamba for Vision Tasks

TL;DR

Abstract

Rotation Equivariant Mamba for Vision Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)