MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining
Yunze Liu, Li Yi
TL;DR
The paper tackles pretraining for hybrid Mamba-Transformer vision backbones by introducing Masked Autoregressive Pretraining (MAP), a hierarchical objective that fuses local MAE-style learning with region-wise autoregressive decoding. Implemented on a default HybridNet architecture, MAP uses random masking and row-wise region reconstruction to jointly enhance Transformer local features and Mamba context modeling, achieving strong improvements over MAE, AR, and contrastive baselines in 2D and 3D tasks. Extensive experiments on ImageNet-1K, ADE20K, COCO, ModelNet40, and ScanObjectNN, supported by ablations, validate the effectiveness and generality of MAP for both hybrid and pure backbones. The work advances practical pretraining for hybrid architectures, enabling better performance and transfer in vision tasks across modalities and domains.
Abstract
Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. The code and checkpoints are available at https://github.com/yunzeliu/MAP
