Table of Contents
Fetching ...

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Wenjun Huang, Jiakai Pan, Jiahao Tang, Yanyu Ding, Yifei Xing, Yuhe Wang, Zhengzhuo Wang, Jianguo Hu

TL;DR

ML-Mamba introduces a multimodal large language model that replaces Transformer backbones with the efficient Mamba-2 state-space model to achieve linear-time sequence processing. It couples a vision encoder (DINOv2+SigLIP) with a novel Mamba-2 Scan Connector (MSC), including 2D visual selective scanning (MVSS) and SwiGLU, to fuse visual inputs into a pre-trained Mamba-2 LLM. Through ablations and six benchmarks, ML-Mamba achieves competitive performance with similarly sized models and outperforms larger VLMs on several tasks, while delivering substantially faster inference than Transformer-based baselines. The work demonstrates the viability of state-space models for multimodal tasks and outlines extensible design choices (MMC variants, scan mechanisms) that influence performance and efficiency. Limitations include dataset biases and memory constraints for mobile deployment, suggesting directions for future optimization and broader applicability.

Abstract

Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this issue, we introduce ML-Mamba, a multimodal language model, which utilizes the latest and efficient Mamba-2 model for inference. Mamba-2 is known for its linear scalability and fast processing of long sequences. We replace the Transformer-based backbone with a pre-trained Mamba-2 model and explore methods for integrating 2D visual selective scanning mechanisms into multimodal learning while also trying various visual encoders and Mamba-2 model variants. Our extensive experiments in various multimodal benchmark tests demonstrate the competitive performance of ML-Mamba and highlight the potential of state space models in multimodal tasks. The experimental results show that: (1) we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning. We propose a novel multimodal connector called the Mamba-2 Scan Connector (MSC), which enhances representational capabilities. (2) ML-Mamba achieves performance comparable to state-of-the-art methods such as TinyLaVA and MobileVLM v2 through its linear sequential modeling while faster inference speed; (3) Compared to multimodal models utilizing Mamba-1, the Mamba-2-based ML-Mamba exhibits superior inference performance and effectiveness.

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

TL;DR

ML-Mamba introduces a multimodal large language model that replaces Transformer backbones with the efficient Mamba-2 state-space model to achieve linear-time sequence processing. It couples a vision encoder (DINOv2+SigLIP) with a novel Mamba-2 Scan Connector (MSC), including 2D visual selective scanning (MVSS) and SwiGLU, to fuse visual inputs into a pre-trained Mamba-2 LLM. Through ablations and six benchmarks, ML-Mamba achieves competitive performance with similarly sized models and outperforms larger VLMs on several tasks, while delivering substantially faster inference than Transformer-based baselines. The work demonstrates the viability of state-space models for multimodal tasks and outlines extensible design choices (MMC variants, scan mechanisms) that influence performance and efficiency. Limitations include dataset biases and memory constraints for mobile deployment, suggesting directions for future optimization and broader applicability.

Abstract

Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this issue, we introduce ML-Mamba, a multimodal language model, which utilizes the latest and efficient Mamba-2 model for inference. Mamba-2 is known for its linear scalability and fast processing of long sequences. We replace the Transformer-based backbone with a pre-trained Mamba-2 model and explore methods for integrating 2D visual selective scanning mechanisms into multimodal learning while also trying various visual encoders and Mamba-2 model variants. Our extensive experiments in various multimodal benchmark tests demonstrate the competitive performance of ML-Mamba and highlight the potential of state space models in multimodal tasks. The experimental results show that: (1) we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning. We propose a novel multimodal connector called the Mamba-2 Scan Connector (MSC), which enhances representational capabilities. (2) ML-Mamba achieves performance comparable to state-of-the-art methods such as TinyLaVA and MobileVLM v2 through its linear sequential modeling while faster inference speed; (3) Compared to multimodal models utilizing Mamba-1, the Mamba-2-based ML-Mamba exhibits superior inference performance and effectiveness.
Paper Structure (25 sections, 9 equations, 6 figures, 7 tables)

This paper contains 25 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The architecture of ML-Mamba (right) uses Mamba-2 as the backbone (left). It includes a visual encoder, a multi-modal connector called the Mamba-2 Scan Connector (MSC), an MLP projector, and a language model. We use the pre-trained Mamba-2 large language model (Mamba-2 LLM) as the language model and a pre-trained visual transformer model as the visual encoder.
  • Figure 2: Three architectures of MultiModal Connector: (a) MLP; (b) MSC-MLP (Basic); (c) MSC-MLP (Advanced).
  • Figure 3: Illustration of two different Vision Selective Scan (VSS) Mechanisms: Bidirectional-Scan Mechanism (BSM) (top) and Cross-Scan Mechanism (CSM) (bottom).
  • Figure 4: The comparison of block architectures between Mamba-2 block, and Mamba-2 Scan Connector (BSM, With SwiGLU) and Mamba-2 Scan Connector (CSM, With SwiGLU).
  • Figure 5: SwiGLU structure in MSC-MLP (Advanced).
  • ...and 1 more figures