Table of Contents
Fetching ...

BlackMamba: Mixture of Experts for State-Space Models

Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge

TL;DR

This work addresses the efficiency bottlenecks of decoder-only transformers by integrating Mamba State-Space Model blocks with Mixture-of-Experts routing in a unified architecture called BlackMamba. By replacing attention with linear-time SSM computations and dense MLPs with routed experts, BlackMamba achieves competitive language modeling performance at reduced training and inference FLOPs, with generation exhibiting linear time and memory characteristics. The authors train and open-source 340M/1.5B and 630M/2.8B parameter variants on a 300B-token mixture dataset, introduce a faster Sinkhorn routing initialization, and demonstrate favorable latency and FLOP profiles across long sequences, along with balanced expert utilization for most layers. The release of all weights, checkpoints, and inference code under an Apache 2.0 license aims to catalyze broader exploration of combining SSMs and MoEs, offering a practical path toward efficient, long-context language modeling at scale.

Abstract

State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba

BlackMamba: Mixture of Experts for State-Space Models

TL;DR

This work addresses the efficiency bottlenecks of decoder-only transformers by integrating Mamba State-Space Model blocks with Mixture-of-Experts routing in a unified architecture called BlackMamba. By replacing attention with linear-time SSM computations and dense MLPs with routed experts, BlackMamba achieves competitive language modeling performance at reduced training and inference FLOPs, with generation exhibiting linear time and memory characteristics. The authors train and open-source 340M/1.5B and 630M/2.8B parameter variants on a 300B-token mixture dataset, introduce a faster Sinkhorn routing initialization, and demonstrate favorable latency and FLOP profiles across long sequences, along with balanced expert utilization for most layers. The release of all weights, checkpoints, and inference code under an Apache 2.0 license aims to catalyze broader exploration of combining SSMs and MoEs, offering a practical path toward efficient, long-context language modeling at scale.

Abstract

State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba
Paper Structure (22 sections, 23 equations, 14 figures, 4 tables)

This paper contains 22 sections, 23 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Architecture of dense transformer, dense Mamba, transformer-MoE, and Mamba-MoE
  • Figure 2: Ratio of data categories in the pretraining dataset of BlackMamba
  • Figure 3: Comparison of BlackMamba average evaluation performance across activated forward parameters.
  • Figure 4: Comparison of BlackMamba average evaluation performance across training FLOPs.
  • Figure 5: Generation latency of BlackMamba compared to dense transformers, dense mamba, and transformer-MoE
  • ...and 9 more figures