The Zamba2 Suite: Technical Report
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, Beren Millidge
TL;DR
The paper introduces Zamba2, a family of small open-weight LLMs with a hybrid Mamba2-transformer architecture that achieves state-of-the-art performance and substantially improved inference efficiency. It details a two-phase pretraining on the Zyda-2 dataset, followed by instruction tuning, context-extension techniques, and 4-bit quantization, all released openly. Key contributions include architectural innovations (dual shared blocks, LoRAs, Rotary embeddings), high-quality open pretraining data, and practical post-training methods enabling on-device deployment. By releasing both models and the Zyda-2 dataset, the work advocates for democratizing access to capable, efficient LLMs at sub-10B scales.
Abstract
In this technical report, we present the Zamba2 series -- a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its architecture, training and annealing datasets, and training for up to three trillion tokens. We provide open-source weights for all models of the Zamba2 series as well as instruction-tuned variants that are strongly competitive against comparable instruct-tuned models of their class. We additionally open-source the pretraining dataset, which we call Zyda-2, used to train the Zamba2 series of models. The models and datasets used in this work are openly available at https://huggingface.co/Zyphra
