TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba
Xiaowen Ma, Zhenliang Ni, Xinghao Chen
TL;DR
TinyViM addresses the inefficiency of lightweight vision Mamba backbones by decoupling feature frequencies via a Laplace mixer and employing frequency ramp inception within a Convolution-Mamba hybrid. The method inputs only low-frequency information into the Mamba block while preserving high-frequency details with depthwise convolutions, and gradually shifts frequency emphasis across stages to optimize accuracy and efficiency. Empirical results across ImageNet, COCO, and ADE20K show TinyViM achieving superior accuracy and throughput compared with Convolution, Transformer, and prior Mamba-based models, with particularly strong performance at similar scales. The work offers a practical, hardware-friendly backbone for real-time vision systems and suggests extending TinyViM as a lightweight encoder for multi-modal tasks such as SAM.
Abstract
Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. By observing, we find that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba-based models with similar scales, and the throughput is about 2-3 times higher than that of other Mamba-based models. Code is available at https://github.com/xwmaxwma/TinyViM.
