Table of Contents
Fetching ...

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłoś, Marek Cygan, Sebastian Jaszczur

TL;DR

MoE-Mamba introduces a novel integration of Mixture of Experts with the Mamba State Space Model to achieve efficient selective computation for long-context language modeling. By interleaving a Switch MoE layer with Mamba blocks, the approach attains training-speedups of up to 2.35× while preserving Mamba's advantages over Transformer-based models. The study provides extensive ablations on expert count, architecture variants, and active-parameter ratios, demonstrating robustness across model sizes (up to ~2.4B total parameters) and training scales. The findings open a pathway to scalable, efficient SSM-based models that can compete with large Transformer-based systems and motivate further exploration of MoE-enabled SSM architectures.

Abstract

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer.

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

TL;DR

MoE-Mamba introduces a novel integration of Mixture of Experts with the Mamba State Space Model to achieve efficient selective computation for long-context language modeling. By interleaving a Switch MoE layer with Mamba blocks, the approach attains training-speedups of up to 2.35× while preserving Mamba's advantages over Transformer-based models. The study provides extensive ablations on expert count, architecture variants, and active-parameter ratios, demonstrating robustness across model sizes (up to ~2.4B total parameters) and training scales. The findings open a pathway to scalable, efficient SSM-based models that can compete with large Transformer-based systems and motivate further exploration of MoE-enabled SSM architectures.

Abstract

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in fewer training steps while preserving the inference performance gains of Mamba against Transformer.
Paper Structure (31 sections, 3 equations, 9 figures, 7 tables)

This paper contains 31 sections, 3 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Log perplexity throughout the training. From top to bottom: $\text{Mamba}_{100\text{M}}$; $\text{Transformer-MoE}_{100\text{M}}$; $\text{MoE-Mamba}_{100\text{M}}$.
  • Figure 2: Diagrams of the architectures. From the left: vanilla Transformer, Transformer-MoE, Mamba, MoE-Mamba.
  • Figure 3: Diagram of Parallel MoE-Mamba architecture (left) and Mamba Block (right). The outputs of the Gate and Conv Projections are $E$ (expansion factor) times bigger than the input, i.e., Conv and SSM operate on vectors $\in \mathbb{R}^{E \cdot d_\text{model}}$. Vanilla Mamba assumes $E=2$gu2023mamba. Expansion factor $E$ determines how much the input vector is scaled up by Gate and Conv Projection and then scaled down by Output Projection, and because of that, it is also proportional to the number of FLOPs and parameters in the Mamba layer.
  • Figure 4: Smoothed training loss (log perplexity) for a differing number of experts for MoE-Mamba with ca. 26M active non-embedding parameters. The final log perplexity improves monotonically as the number of experts increases.
  • Figure 5: Final log perplexity at different ratios of active Mamba-to-MoE active parameters. Note that MoE contains the majority of the total parameters in each model. For further discussion of the ratios explored, see Appendix \ref{['app:optimal_ratio']}.
  • ...and 4 more figures