Table of Contents
Fetching ...

OuroMamba: A Data-Free Quantization Framework for Vision Mamba

Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna

TL;DR

OuroMamba tackles the data-free post-training quantization of vision Mamba models by addressing two core issues: limited long-range interactions in recurrent S6 blocks and dynamic activation outliers across time steps. It introduces a two-stage framework: OuroMamba-Gen enhances implicit attention with patched neighborhood interactions to generate semantically meaningful synthetic data, and OuroMamba-Quant employs mixed-precision quantization with online outlier detection to minimize quantization error. Across classification, detection, segmentation, and diffusion tasks, OuroMamba achieves state-of-the-art results with data-free calibration using only 128 synthetic samples, and delivers practical latency improvements with a dedicated GEMM kernel. This work enables privacy-preserving, efficient deployment of VMMs, expanding the applicability of data-free quantization to state-of-the-art vision models.

Abstract

We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code and synthetic dataset are available here: https://github.com/georgia-tech-synergy-lab/ICCV-OuroMamba

OuroMamba: A Data-Free Quantization Framework for Vision Mamba

TL;DR

OuroMamba tackles the data-free post-training quantization of vision Mamba models by addressing two core issues: limited long-range interactions in recurrent S6 blocks and dynamic activation outliers across time steps. It introduces a two-stage framework: OuroMamba-Gen enhances implicit attention with patched neighborhood interactions to generate semantically meaningful synthetic data, and OuroMamba-Quant employs mixed-precision quantization with online outlier detection to minimize quantization error. Across classification, detection, segmentation, and diffusion tasks, OuroMamba achieves state-of-the-art results with data-free calibration using only 128 synthetic samples, and delivers practical latency improvements with a dedicated GEMM kernel. This work enables privacy-preserving, efficient deployment of VMMs, expanding the applicability of data-free quantization to state-of-the-art vision models.

Abstract

We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code and synthetic dataset are available here: https://github.com/georgia-tech-synergy-lab/ICCV-OuroMamba

Paper Structure

This paper contains 23 sections, 7 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Qualitative comparison of VMM implicit attention. (a) Sample input image. (b) Normalized distribution of the input gate values ($\Delta$). Visualization of implicit attention ali2024hidden using (c) the original state ($h$) and (d) the proposed patched state ($h_p$), which incorporates spatial dependencies through patched neighborhood interactions.
  • Figure 2: W4A8 quantization performance comparison under the impact of different calibration data sources.
  • Figure 3: (a) Forward and backward SSM state transition of first layer of Vim-S zhu2024vision, (b) Naive synthetic data samples generated by applying ramachandran2024clamp on VMM implicit attention.
  • Figure 4: Dynamic inter-time-step outlier channel variations for two representative S6 layer activations: $\bar{A}, \bar{B}$ in layer 3 of Vim-T.
  • Figure 5: Synthetic data samples generated by OuroMamba.
  • ...and 4 more figures