Table of Contents
Fetching ...

Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

Yi-Kuan Hsieh, Jun-Wei Hsieh, Xin li, Ming-Ching Chang, Yu-Chee Tseng

TL;DR

The paper analyzes how scan order affects Vision State Space Models (VSSMs) and introduces Partial Ring Scan Mamba (PRISMamba), a rotation-robust ring-based traversal that maintains linear-time complexity $O(HW)$ with partial channel filtering. By aggregating features ring-by-ring and propagating context radially through short SSMs, PRISMamba preserves spatial locality under rotation while minimizing global path fragility. Empirical results on ImageNet-1K and COCO show PRISMamba achieving state-of-the-art accuracy–efficiency among Vision SSMs and strong rotation robustness, outperforming VMamba with lower FLOPs and higher throughput. Limitations include fixed image centers and discrete ring widths, with future work exploring learnable ring origins, content-adaptive rings, and extensions to spatio-temporal data.

Abstract

State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.

Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

TL;DR

The paper analyzes how scan order affects Vision State Space Models (VSSMs) and introduces Partial Ring Scan Mamba (PRISMamba), a rotation-robust ring-based traversal that maintains linear-time complexity with partial channel filtering. By aggregating features ring-by-ring and propagating context radially through short SSMs, PRISMamba preserves spatial locality under rotation while minimizing global path fragility. Empirical results on ImageNet-1K and COCO show PRISMamba achieving state-of-the-art accuracy–efficiency among Vision SSMs and strong rotation robustness, outperforming VMamba with lower FLOPs and higher throughput. Limitations include fixed image centers and discrete ring widths, with future work exploring learnable ring origins, content-adaptive rings, and extensions to spatio-temporal data.

Abstract

State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.
Paper Structure (23 sections, 10 equations, 4 figures, 6 tables)

This paper contains 23 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Scanning order affects Vision-Mamba performance.(a) Fixed-path scans ( e.g., raster or serpentine in VMamba liu2024vmambazhu2024visionmamba, Zigma hu2024zigma, MaIR li2025mair, LocalMamba huang2024localmamba) preserve sequence–space alignment only under flips. An in-plane rotation (here $60^\circ$) causes padding and global reindexing, fracturing the path so the recurrent kernel moves along misaligned neighborhoods. (b) Our Ring Scan treats serialization as order-agnostic aggregation within concentric rings, followed by radial composition from inner to outer, producing a rotation-stable sequence without polar remapping or rotation-specific training.
  • Figure 2: Architecture with Partial RIng Scan Mamba (PRISMamba). The image is patchified and processed by a four-stage backbone; stage $i$ stacks $L_i$PRISM blocks (Partial RIng Scan Mamba) with $C_i$ output channels, and stages are separated by downsampling. Each PRISM performs order-agnostic aggregation over a subset of concentric rings (partial ring scan), composes information radially with a short sequence operator, and writes features back via a $1{\times}1$ projection before residual fusion. Channel filtering routes only the most informative channels while keeping the rest on a lightweight residual branch for further efficiency improvement.
  • Figure 3: Ring Scan. Pixels are partitioned into concentric rings, which are interactively traversed in a clockwise or counterclockwise sequence. The resulting features are then aggregated in an order-independent fashion, proceeding from inner to outer rings.
  • Figure 4: Primitive scan orders. Twelve canonical paths (S1–S12) such as left-to-right raster, serpentine, and diagonal produce distinct 1D sequences from the same image. S1-12 evaluate single scans; S13–18 evaluate pairs of scans; S 19–21 aggregate four scans, enabling a systematic comparison of scan-order effects.