Table of Contents
Fetching ...

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining

Yunze Liu, Li Yi

TL;DR

The paper tackles pretraining for hybrid Mamba-Transformer vision backbones by introducing Masked Autoregressive Pretraining (MAP), a hierarchical objective that fuses local MAE-style learning with region-wise autoregressive decoding. Implemented on a default HybridNet architecture, MAP uses random masking and row-wise region reconstruction to jointly enhance Transformer local features and Mamba context modeling, achieving strong improvements over MAE, AR, and contrastive baselines in 2D and 3D tasks. Extensive experiments on ImageNet-1K, ADE20K, COCO, ModelNet40, and ScanObjectNN, supported by ablations, validate the effectiveness and generality of MAP for both hybrid and pure backbones. The work advances practical pretraining for hybrid architectures, enabling better performance and transfer in vision tasks across modalities and domains.

Abstract

Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. The code and checkpoints are available at https://github.com/yunzeliu/MAP

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining

TL;DR

The paper tackles pretraining for hybrid Mamba-Transformer vision backbones by introducing Masked Autoregressive Pretraining (MAP), a hierarchical objective that fuses local MAE-style learning with region-wise autoregressive decoding. Implemented on a default HybridNet architecture, MAP uses random masking and row-wise region reconstruction to jointly enhance Transformer local features and Mamba context modeling, achieving strong improvements over MAE, AR, and contrastive baselines in 2D and 3D tasks. Extensive experiments on ImageNet-1K, ADE20K, COCO, ModelNet40, and ScanObjectNN, supported by ablations, validate the effectiveness and generality of MAP for both hybrid and pure backbones. The work advances practical pretraining for hybrid architectures, enabling better performance and transfer in vision tasks across modalities and domains.

Abstract

Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. The code and checkpoints are available at https://github.com/yunzeliu/MAP
Paper Structure (9 sections, 5 equations, 6 figures, 14 tables)

This paper contains 9 sections, 5 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: We propose Masked Autoregressive Pretraining to pretrain the hybrid Mamba-Transformer backbone. It demonstrates significant performance improvements on both 2D and 3D tasks.
  • Figure 2: (a) MAE Pretraining. Its core lies in reconstructing the masked tokens based on the unmasked tokens to build a global bidirectional contextual understanding. (b) AR Pretraining. It focuses on building correlations between contexts, and its scalability has been thoroughly validated in the field of large language models. (c) MAP Pretraining(ours). Our method first randomly masks the input image, and then reconstructs the original image in a row-by-row autoregressive manner. This pretraining approach demonstrates significant advantages in modeling contextual features of local characteristics and the correlations between local features, making it highly compatible with the Mamba-Transformer hybrid architecture. (d) Performance Gains under different pretraining strategies on ImageNet-1K. We found MAE pretraining is better suited for Transformers, while AR is more compatible with Mamba. MAP, on the other hand, is more suited for the Mamba-Transformer backbone. Additionally, MAP also demonstrates impressive performance when pretraining with pure Mamba or pure Transformer backbones, showcasing the effectiveness and broad applicability of our method.
  • Figure 3: Different orders for AR pretraining and Mamba scanning. The row-first and column-first orders allow the network to perceive local information in different ways and sequences.
  • Figure 4: Different Hybrid Model Design. (d) achieves the best results and is set as default and refer to it as HybridNet.
  • Figure 5: Different Masking Strategies. The random masking strategy produces the best results.
  • ...and 1 more figures