EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim
TL;DR
EfficientViM tackles the inefficiency of global token mixers by introducing Hidden State Mixer-based SSD (HSM-SSD), which moves channel mixing and gating into compressed hidden states to reduce the dominant linear projections from $\mathcal{O}(LD^2)$ toward $\mathcal{O}(ND^2)$ and $\mathcal{O}(LND)$. It further enhances representation power with multi-stage hidden state fusion (MSF) and adopts a hardware-friendly, single-head design to minimize memory-bound bottlenecks, achieving superior speed-accuracy on ImageNet-1K and strong performance in dense prediction tasks. Across ImageNet, COCO, and ADE20K, EfficientViM variants outperform prior vision Mambas and SHViT, with EfficientViM-M2–M4 delivering notable throughput gains and competitive or improved accuracy, especially at high resolutions. The work provides extensive ablations, high-resolution scalability, and CPU/mobile latency analysis, showing practical benefits for on-device deployment and real-world applications, with code released for reproducibility.
Abstract
For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model (SSM) has emerged as an effective operation for global interaction with its favorable linear computational cost in the number of tokens. To harness the efficacy of SSM, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. With the observation that the runtime of the SSD layer is driven by the linear projections on the input sequences, we redesign the original SSD layer to perform the channel mixing operation within compressed hidden states in the HSM-SSD layer. Additionally, we propose multi-stage hidden state fusion to reinforce the representation power of hidden states and provide the design to alleviate the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.
