Table of Contents
Fetching ...

The Zamba2 Suite: Technical Report

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, Beren Millidge

TL;DR

The paper introduces Zamba2, a family of small open-weight LLMs with a hybrid Mamba2-transformer architecture that achieves state-of-the-art performance and substantially improved inference efficiency. It details a two-phase pretraining on the Zyda-2 dataset, followed by instruction tuning, context-extension techniques, and 4-bit quantization, all released openly. Key contributions include architectural innovations (dual shared blocks, LoRAs, Rotary embeddings), high-quality open pretraining data, and practical post-training methods enabling on-device deployment. By releasing both models and the Zyda-2 dataset, the work advocates for democratizing access to capable, efficient LLMs at sub-10B scales.

Abstract

In this technical report, we present the Zamba2 series -- a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its architecture, training and annealing datasets, and training for up to three trillion tokens. We provide open-source weights for all models of the Zamba2 series as well as instruction-tuned variants that are strongly competitive against comparable instruct-tuned models of their class. We additionally open-source the pretraining dataset, which we call Zyda-2, used to train the Zamba2 series of models. The models and datasets used in this work are openly available at https://huggingface.co/Zyphra

The Zamba2 Suite: Technical Report

TL;DR

The paper introduces Zamba2, a family of small open-weight LLMs with a hybrid Mamba2-transformer architecture that achieves state-of-the-art performance and substantially improved inference efficiency. It details a two-phase pretraining on the Zyda-2 dataset, followed by instruction tuning, context-extension techniques, and 4-bit quantization, all released openly. Key contributions include architectural innovations (dual shared blocks, LoRAs, Rotary embeddings), high-quality open pretraining data, and practical post-training methods enabling on-device deployment. By releasing both models and the Zyda-2 dataset, the work advocates for democratizing access to capable, efficient LLMs at sub-10B scales.

Abstract

In this technical report, we present the Zamba2 series -- a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its architecture, training and annealing datasets, and training for up to three trillion tokens. We provide open-source weights for all models of the Zamba2 series as well as instruction-tuned variants that are strongly competitive against comparable instruct-tuned models of their class. We additionally open-source the pretraining dataset, which we call Zyda-2, used to train the Zamba2 series of models. The models and datasets used in this work are openly available at https://huggingface.co/Zyphra

Paper Structure

This paper contains 13 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Performance (MMLU 5-shot or 0-shot) vs time-to-first-token for the Zamba2 series models vs leading competing models. Due to its novel Zamba2 architecture, our series of models significantly outperforms others in both quality and latency.
  • Figure 2: Architecture diagrams for the 1.2B, 2.7B and 7.4B models. The 1.2B architecture differs in also including LoRAs on the shared attention blocks and only a single shared block. The single shared block for the 1.2B was used because the benefit of two alternating blocks was significantly less for the smaller model, since there are less total attention blocks. The 2.7B and 7.4B models lack the shared attention LoRAs because we only discovered that it was beneficial after training had commenced.
  • Figure 3: Pipeline for producing the Zyda-2 dataset. Zyda-2 comprises four component datasets: Zyda-1, DCLM, FineWeb, and Dolma. We cross-deduplicated all dataests against each other. For Zyda-1 and Dolma we also performed model-based quality filtering using Nvidia's Nemo.
  • Figure 4: The performance of Zyda-2 vs other leading language modelling datasets. Reported is the average score on a set of standard language modelling evaluation tasks for annealing Zamba2-2.7b on each dataset. We followed blakeney2024does's annealing ablation protocol to measure performance instead of training models from scratch because we observed significantly higher signal with this approach.
  • Figure 5: Performance (in 5-shot MMLU) vs the number of tokens used for training. We observe a fairly clear sigmoidal curve of performance vs training tokens for current leading transformer models with Zamba2 and Gemma2 being clear outliers. We believe Gemma2 is an outlier because of its use of distillation from a larger model, while Zamba2 outperforms due to its architecture.
  • ...and 5 more figures