Table of Contents
Fetching ...

Autoregressive Pretraining with Mamba in Vision

Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie

TL;DR

<3-5 sentence high-level summary> ARM introduces autoregressive pretraining for Vision Mamba architectures by using cluster-based prediction units and row-wise forward ordering, implemented via MambaMLP blocks. The approach delivers strong accuracy gains, enabling large and huge model scaling with improved training efficiency compared to prior pretraining strategies. Empirical results on ImageNet-1K show base-size ARM surpassing supervised baselines and achieving state-of-the-art Mamba vision performance, with ARM-H reaching 85.0% (85.5% with higher input resolution) and robust improvements on out-of-domain datasets. Comprehensive ablations identify optimal cluster size, decoder configuration, and targets, while demonstrating ARM’s superiority over MAE and contrastive pretraining for Mamba in vision.

Abstract

The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.

Autoregressive Pretraining with Mamba in Vision

TL;DR

<3-5 sentence high-level summary> ARM introduces autoregressive pretraining for Vision Mamba architectures by using cluster-based prediction units and row-wise forward ordering, implemented via MambaMLP blocks. The approach delivers strong accuracy gains, enabling large and huge model scaling with improved training efficiency compared to prior pretraining strategies. Empirical results on ImageNet-1K show base-size ARM surpassing supervised baselines and achieving state-of-the-art Mamba vision performance, with ARM-H reaching 85.0% (85.5% with higher input resolution) and robust improvements on out-of-domain datasets. Comprehensive ablations identify optimal cluster size, decoder configuration, and targets, while demonstrating ARM’s superiority over MAE and contrastive pretraining for Mamba in vision.

Abstract

The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.
Paper Structure (27 sections, 9 equations, 4 figures, 8 tables)

This paper contains 27 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Compared to Vim, our ARM considerably boosts the ImageNet accuracy and, more critically, offers a stronger pathway for scaling up.
  • Figure 2: Different prediction units in the autoregressive modeling.
  • Figure 3: Different prediction orderings of a visual sentence.
  • Figure 4: The comparison of block architectures between Vim, and MambaMLP in pretraining and in finetuning.