Table of Contents
Fetching ...

StableMamba: Distillation-free Scaling of Large SSMs for Images and Videos

Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall

TL;DR

StableMamba introduces a distillation-free, scalable architecture that interleaves Mamba blocks with Attention-based Transformers to overcome the parameter-growth and robustness limitations of prior vision SSMs. By embedding Transformer blocks within bi-directional Mamba layers and using RMS normalization with internal MLPs, it stabilizes training and improves resilience to common corruptions while enabling models to scale beyond tens of millions of parameters. Across ImageNet-1K, Kinetics-400, and Something-Something-v2, StableMamba matches or exceeds the performance of VideoMamba without distillation and demonstrates competitive robustness on ImageNet-C, achieving notable gains such as up to $+1.7$ accuracy points over prior Mamba baselines. The work provides a practical path toward high-capacity, robust vision models that leverage the strengths of both state-space models and attention mechanisms, with implications for scalable large-model vision systems.

Abstract

State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.

StableMamba: Distillation-free Scaling of Large SSMs for Images and Videos

TL;DR

StableMamba introduces a distillation-free, scalable architecture that interleaves Mamba blocks with Attention-based Transformers to overcome the parameter-growth and robustness limitations of prior vision SSMs. By embedding Transformer blocks within bi-directional Mamba layers and using RMS normalization with internal MLPs, it stabilizes training and improves resilience to common corruptions while enabling models to scale beyond tens of millions of parameters. Across ImageNet-1K, Kinetics-400, and Something-Something-v2, StableMamba matches or exceeds the performance of VideoMamba without distillation and demonstrates competitive robustness on ImageNet-C, achieving notable gains such as up to accuracy points over prior Mamba baselines. The work provides a practical path toward high-capacity, robust vision models that leverage the strengths of both state-space models and attention mechanisms, with implications for scalable large-model vision systems.

Abstract

State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to .
Paper Structure (12 sections, 6 equations, 6 figures, 6 tables)

This paper contains 12 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Performance comparison with VideoMamba: We compare the performance of our model with VideoMamba li2024videomamba, both with and without distillation, on IN1K deng2009imagenet.
  • Figure 2: (a) Performance comparison of different networks on Gaussian blur corruption. (b) Performance comparison of different networks on JPEG compression corruption.
  • Figure 3: Loss curves obtained from training VideoMamba with and without distillation.
  • Figure 4: (a) The overall architecture of the StableMamba model. (b) Anatomy of Transformer block. (c) Anatomy of Mamba block. (d) Anatomy of bidirectional Mamba layer.
  • Figure 5: (a) Impact of the position of the Transformer block within StableMamba. (b) Impact of the ratio of Transformer blocks to Mamba blocks.
  • ...and 1 more figures