Table of Contents
Fetching ...

Falcon Mamba: The First Competitive Attention-free 7B Language Model

Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, Hakim Hacid

TL;DR

Falcon Mamba 7B introduces a pure Mamba State Space Model that competes with Transformer baselines at 7B scale, addressing the question of whether attention-free architectures can reach SoTA performance. Trained on a large, carefully curated data mix and using a four-stage learning-rate curriculum with long-context focus, it achieves strong results across HF Leaderboard benchmarks and long-context tasks, while delivering constant memory usage for long generations. The work demonstrates notable advantages in inference speed and memory efficiency for long sequences, and provides detailed integration support via HuggingFace, with a public pre-decay checkpoint that researchers can build upon. While showing strong general performance, the authors acknowledge potential limitations in in-context learning relative to Transformers and highlight avenues for future work, including ultra-long contexts and hybrid models that combine the strengths of SSMs and attention mechanisms.

Abstract

In this technical report, we present Falcon Mamba 7B, a new base large language model based on the novel Mamba architecture. Falcon Mamba 7B is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, Falcon Mamba 7B surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B and RWKV-v6 Finch 7B/14B. Currently, Falcon Mamba 7B is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models, according to the Open LLM Leaderboard. Due to its architecture, Falcon Mamba 7B is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that hybrid Mamba-Transformer models outperform pure architecture designs, we demonstrate that even the pure Mamba design can achieve similar, or even superior results compared to the Transformer and hybrid designs. We make the weights of our implementation of Falcon Mamba 7B publicly available on https://huggingface.co/tiiuae/falcon-mamba-7b, under a permissive license.

Falcon Mamba: The First Competitive Attention-free 7B Language Model

TL;DR

Falcon Mamba 7B introduces a pure Mamba State Space Model that competes with Transformer baselines at 7B scale, addressing the question of whether attention-free architectures can reach SoTA performance. Trained on a large, carefully curated data mix and using a four-stage learning-rate curriculum with long-context focus, it achieves strong results across HF Leaderboard benchmarks and long-context tasks, while delivering constant memory usage for long generations. The work demonstrates notable advantages in inference speed and memory efficiency for long sequences, and provides detailed integration support via HuggingFace, with a public pre-decay checkpoint that researchers can build upon. While showing strong general performance, the authors acknowledge potential limitations in in-context learning relative to Transformers and highlight avenues for future work, including ultra-long contexts and hybrid models that combine the strengths of SSMs and attention mechanisms.

Abstract

In this technical report, we present Falcon Mamba 7B, a new base large language model based on the novel Mamba architecture. Falcon Mamba 7B is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, Falcon Mamba 7B surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B and RWKV-v6 Finch 7B/14B. Currently, Falcon Mamba 7B is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models, according to the Open LLM Leaderboard. Due to its architecture, Falcon Mamba 7B is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that hybrid Mamba-Transformer models outperform pure architecture designs, we demonstrate that even the pure Mamba design can achieve similar, or even superior results compared to the Transformer and hybrid designs. We make the weights of our implementation of Falcon Mamba 7B publicly available on https://huggingface.co/tiiuae/falcon-mamba-7b, under a permissive license.
Paper Structure (14 sections, 1 equation, 3 figures, 3 tables)

This paper contains 14 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Data mixtures across training stages
  • Figure 2: We vary the context length of the prompt to determine the maximum sequence length that could be processed without encountering an out-of-memory (OOM) error. To ensure a fair comparison, all models were configured with a rescaled vocabulary size.
  • Figure 3: With a fixed batch size and context length of 1, we vary the generated tokens up to 130k for Faclon-Mamba-7B, and Mistral-7B with a resized vocabulary for fair comparisons.