Table of Contents
Fetching ...

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

Xiaowen Ma, Zhenliang Ni, Xinghao Chen

TL;DR

TinyViM addresses the inefficiency of lightweight vision Mamba backbones by decoupling feature frequencies via a Laplace mixer and employing frequency ramp inception within a Convolution-Mamba hybrid. The method inputs only low-frequency information into the Mamba block while preserving high-frequency details with depthwise convolutions, and gradually shifts frequency emphasis across stages to optimize accuracy and efficiency. Empirical results across ImageNet, COCO, and ADE20K show TinyViM achieving superior accuracy and throughput compared with Convolution, Transformer, and prior Mamba-based models, with particularly strong performance at similar scales. The work offers a practical, hardware-friendly backbone for real-time vision systems and suggests extending TinyViM as a lightweight encoder for multi-modal tasks such as SAM.

Abstract

Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. By observing, we find that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba-based models with similar scales, and the throughput is about 2-3 times higher than that of other Mamba-based models. Code is available at https://github.com/xwmaxwma/TinyViM.

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

TL;DR

TinyViM addresses the inefficiency of lightweight vision Mamba backbones by decoupling feature frequencies via a Laplace mixer and employing frequency ramp inception within a Convolution-Mamba hybrid. The method inputs only low-frequency information into the Mamba block while preserving high-frequency details with depthwise convolutions, and gradually shifts frequency emphasis across stages to optimize accuracy and efficiency. Empirical results across ImageNet, COCO, and ADE20K show TinyViM achieving superior accuracy and throughput compared with Convolution, Transformer, and prior Mamba-based models, with particularly strong performance at similar scales. The work offers a practical, hardware-friendly backbone for real-time vision systems and suggests extending TinyViM as a lightweight encoder for multi-modal tasks such as SAM.

Abstract

Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. By observing, we find that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba-based models with similar scales, and the throughput is about 2-3 times higher than that of other Mamba-based models. Code is available at https://github.com/xwmaxwma/TinyViM.

Paper Structure

This paper contains 21 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparision of GFLOPs, Throughput and accuracy between TinyViM and others. The top-1 accuracy is tested on ImageNet-1K and the throughput is measured on Nvidia V100 GPU with maximum power-of-two batch size that fits in memory. Compared to recent Mamba-based models such as EfficientVMamba, QuadMamba, and VMamba, TinyViM has higher top-1 accuracy with $2\times$ higher throughput and fewer GFLOPs.
  • Figure 2: Spectral analysis of Mamba. (a) Frequency magnitude ($14 \times 14$) from 8 output channels before- and after Mamba block. (b) Relative log amplitudes of Fourier transformed feature maps. The magnitude and amplitude are averaged over 384 samples. (a) and (b) show that Mamba block focuses on capturing low-frequency information under the Convolution-Mamba hybrid architecture.
  • Figure 3: Overview of the proposed TinyViM, which has four stages and each stage consists of Local Blocks and TinyViM Blocks. The local block applies a reparameterized 3x3 Convolution to extract local features, and the TinyViM block is employed to capture the global context. The core component of the TinyViM block is the Laplace Mixer, which decouples the frequencies of the features based on an efficient Laplace pyramid and passes the state only for the low-frequencies. The enhanced low-and high-frequency components are then integrated based on a frequency ramp Inception structure. Thus, the high- and low-frequency components of different stages are appropriately trade-off and the efficiency is further improved. SS2D denotes 2D selective scanning, which is used in VMamba vmamba.
  • Figure 4: Fourier spectrum of TinyViM for the Low frequency branch and High frequency branch. It can be observed that we achieve the decoupling of the high and low frequency components of the features and specifically enhance them separately.
  • Figure 5: The ERF of MobileOne, SwiftFormer and TinyViM. Our TinyViM effectively obtains large ERFs with Laplace Mixer.
  • ...and 2 more figures