CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation
Weiquan Huang, Yifei Shen, Yifan Yang
TL;DR
The paper tackles zero-shot generalization and OOD robustness for vision models by integrating CLIP-style contrastive pretraining with Mamba state-space backbones. Using CLIP pretraining, they train Mamba variants and evaluate on 26 zero-shot datasets and 16 OOD datasets, comparing against ViT baselines. Key findings show that a 50–67 million-parameter Mamba can match or exceed much larger ViT models (e.g., 84M ViT-B and 307M ViT-L) on zero-shot tasks, and that Mamba demonstrates superior OOD robustness and shape bias. A Hessian analysis reveals a sharper, more non-convex optimization landscape for Mamba relative to ViT, suggesting training challenges, with code released at the provided GitHub URL.
Abstract
State space models and Mamba-based models have been increasingly applied across various domains, achieving state-of-the-art performance. This technical report introduces the first attempt to train a transferable Mamba model utilizing contrastive language-image pretraining (CLIP). We have trained Mamba models of varying sizes and undertaken comprehensive evaluations of these models on 26 zero-shot classification datasets and 16 out-of-distribution (OOD) datasets. Our findings reveal that a Mamba model with 67 million parameters is on par with a 307 million-parameter Vision Transformer (ViT) model in zero-shot classification tasks, highlighting the parameter efficiency of Mamba models. In tests of OOD generalization, Mamba-based models exhibit exceptional performance in conditions of OOD image contrast or when subjected to high-pass filtering. However, a Hessian analysis indicates that Mamba models feature a sharper and more non-convex landscape compared to ViT-based models, making them more challenging to train. The source code is available at https://github.com/raytrun/mamba-clip.
