Table of Contents
Fetching ...

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim

TL;DR

EfficientViM tackles the inefficiency of global token mixers by introducing Hidden State Mixer-based SSD (HSM-SSD), which moves channel mixing and gating into compressed hidden states to reduce the dominant linear projections from $\mathcal{O}(LD^2)$ toward $\mathcal{O}(ND^2)$ and $\mathcal{O}(LND)$. It further enhances representation power with multi-stage hidden state fusion (MSF) and adopts a hardware-friendly, single-head design to minimize memory-bound bottlenecks, achieving superior speed-accuracy on ImageNet-1K and strong performance in dense prediction tasks. Across ImageNet, COCO, and ADE20K, EfficientViM variants outperform prior vision Mambas and SHViT, with EfficientViM-M2–M4 delivering notable throughput gains and competitive or improved accuracy, especially at high resolutions. The work provides extensive ablations, high-resolution scalability, and CPU/mobile latency analysis, showing practical benefits for on-device deployment and real-world applications, with code released for reproducibility.

Abstract

For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model (SSM) has emerged as an effective operation for global interaction with its favorable linear computational cost in the number of tokens. To harness the efficacy of SSM, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. With the observation that the runtime of the SSD layer is driven by the linear projections on the input sequences, we redesign the original SSD layer to perform the channel mixing operation within compressed hidden states in the HSM-SSD layer. Additionally, we propose multi-stage hidden state fusion to reinforce the representation power of hidden states and provide the design to alleviate the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

TL;DR

EfficientViM tackles the inefficiency of global token mixers by introducing Hidden State Mixer-based SSD (HSM-SSD), which moves channel mixing and gating into compressed hidden states to reduce the dominant linear projections from toward and . It further enhances representation power with multi-stage hidden state fusion (MSF) and adopts a hardware-friendly, single-head design to minimize memory-bound bottlenecks, achieving superior speed-accuracy on ImageNet-1K and strong performance in dense prediction tasks. Across ImageNet, COCO, and ADE20K, EfficientViM variants outperform prior vision Mambas and SHViT, with EfficientViM-M2–M4 delivering notable throughput gains and competitive or improved accuracy, especially at high resolutions. The work provides extensive ablations, high-resolution scalability, and CPU/mobile latency analysis, showing practical benefits for on-device deployment and real-world applications, with code released for reproducibility.

Abstract

For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model (SSM) has emerged as an effective operation for global interaction with its favorable linear computational cost in the number of tokens. To harness the efficacy of SSM, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. With the observation that the runtime of the SSD layer is driven by the linear projections on the input sequences, we redesign the original SSD layer to perform the channel mixing operation within compressed hidden states in the HSM-SSD layer. Additionally, we propose multi-stage hidden state fusion to reinforce the representation power of hidden states and provide the design to alleviate the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.

Paper Structure

This paper contains 23 sections, 2 theorems, 11 equations, 8 figures, 14 tables, 1 algorithm.

Key Result

Proposition 1

Let $N=L$, $\mathbf{a} \mathbbm{1}^\top_L \odot \mathbf{B} = \mathbbm{I}_L$, and $\mathbf{C} \in \mathbb{R}^{L \times L}$ be diagonal. Then, $\text{HSM-SSD}(\mathbf{x}, \mathbf{a}, \mathbf{B}, \mathbf{C})$ is equivalent to $\text{NC-SSD}(\mathbf{x}, \mathbf{a}, \mathbf{B}, \mathbf{C})$ including gat

Figures (8)

  • Figure 1: Comparison of efficient networks on ImageNet-1K deng2009imagenet classification. The family of our EfficientViM, marked as red and blue stars, shows the best speed-accuracy trade-offs. ✝ indicates the model trained with distillation following touvron2021training.
  • Figure 2: Illustration of (left) NC-SSD and (right) HSM-SSD layer. In the HSM-SSD layer, the computationally heavy projections are handled with the reduced hidden state in HSM as highlighted. Red, blue, and orange colors indicate the operation requiring the complexities of $\mathcal{O}(LD^2)$, $\mathcal{O}(LND)$, and $\mathcal{O}(ND^2)$.
  • Figure 3: Runtime breakdown of HSM-SSD with EfficientViM-M2. The operations highlighted in red are memory-bound.
  • Figure 4: (left) Overall architecture and (right) block design of EfficientViM. The dotted line indicates a skip connection for multi-stage hidden state fusion (MSF). Illustration of the HSM-SSD layer in the EfficientViM block is presented in \ref{['fig:layer']}.
  • Figure A: Latency comparison of recent efficient networks for an extremely high-resolution image.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Remark 1
  • Proposition 1
  • proof