Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing
Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu
TL;DR
Llamba advances efficient language modeling by distilling Transformer knowledge into subquadratic recurrent architectures (Discrete Mamba-2) using MOHAWK, achieving high throughput with far less training data (<0.1% of typical amounts). The approach preserves Llama-based architectural cues while introducing architectural distillation and on-device optimizations, enabling practical edge deployment with 4-bit quantization. Empirical results show Llamba-1B/3B/8B delivering competitive benchmark performance and superior throughput versus Transformer baselines, including strong MMLU gains relative to the teacher. This work highlights a promising path for scalable, memory-efficient language models that maintain quality while enabling private, on-device processing.
Abstract
We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.
