Table of Contents
Fetching ...

Zebra-Llama: Towards Extremely Efficient Hybrid Models

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, Emad Barsoum

TL;DR

Zebra-Llama introduces a practical post-training framework to compose extremely efficient hybrid LLMs by integrating MLA and Mamba2 layers into existing pre-trained Transformers. The pipeline uses refined initialization (including SVD-based MLA init and Mamba2 mapping), Intermediate Layer Distillation (ILD), and SMART layer placement to preserve teacher knowledge while dramatically reducing KV cache and memory. End-to-end distillation followed by Direct Preference Optimization (DPO) yields models that match or exceed Transformer-level accuracy with 25x–36x KV cache compression and substantially higher inference throughput. The approach is validated across Llama3 and Qwen families, with strong zero-shot, few-shot, and long-context performance, and shows practical potential for democratizing access to efficient LLMs. The work highlights scalable, data-efficient post-training methods to deploy capable hybrids in resource-constrained environments.

Abstract

With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.

Zebra-Llama: Towards Extremely Efficient Hybrid Models

TL;DR

Zebra-Llama introduces a practical post-training framework to compose extremely efficient hybrid LLMs by integrating MLA and Mamba2 layers into existing pre-trained Transformers. The pipeline uses refined initialization (including SVD-based MLA init and Mamba2 mapping), Intermediate Layer Distillation (ILD), and SMART layer placement to preserve teacher knowledge while dramatically reducing KV cache and memory. End-to-end distillation followed by Direct Preference Optimization (DPO) yields models that match or exceed Transformer-level accuracy with 25x–36x KV cache compression and substantially higher inference throughput. The approach is validated across Llama3 and Qwen families, with strong zero-shot, few-shot, and long-context performance, and shows practical potential for democratizing access to efficient LLMs. The work highlights scalable, data-efficient post-training methods to deploy capable hybrids in resource-constrained environments.

Abstract

With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.

Paper Structure

This paper contains 53 sections, 19 equations, 9 figures, 16 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparing 8B-scale models on average LM Harness score vs. KV cache size. Zebra-Llama (green) matches or exceeds baselines with smaller KV cache and fewer training tokens. Circle and square sizes indicate training tokens (billions for post-training, trillions for pre-training).
  • Figure 2: Overview of our hybrid model composition pipeline. The process consists of three stages: (1) Weight Initialization -- we initialize pure Mamba2 and MLA models from a pre-trained Transformer via structured mapping; (2) Refined Initialization through Intermediate Layer Distillation (ILD) -- we refine both models by aligning their internal representations with the base model on a small dataset; and (3) SMART Layer Selection -- we compose the final hybrid model by selecting MLA and Mamba2 layers based on sensitivity analysis.
  • Figure 3: Layer sensitivity scores for Llama3.2-1B using 4096 samples from the validation dataset. Red markers indicate the MLA layer indices selected by our SMART strategy with $N=4$.
  • Figure 4: Inference throughput vs. context length of various 8B-size models. We measure the throughput under batch size 48 and output length 1024.
  • Figure 5: Inference peak memory vs. context length of various 8B-size models. The out-of-memory scenarios are marked with 'OOM'.
  • ...and 4 more figures