MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłoś, Marek Cygan, Sebastian Jaszczur
TL;DR
MoE-Mamba introduces a novel integration of Mixture of Experts with the Mamba State Space Model to achieve efficient selective computation for long-context language modeling. By interleaving a Switch MoE layer with Mamba blocks, the approach attains training-speedups of up to 2.35× while preserving Mamba's advantages over Transformer-based models. The study provides extensive ablations on expert count, architecture variants, and active-parameter ratios, demonstrating robustness across model sizes (up to ~2.4B total parameters) and training scales. The findings open a pathway to scalable, efficient SSM-based models that can compete with large Transformer-based systems and motivate further exploration of MoE-enabled SSM architectures.
Abstract
State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer.
