Table of Contents
Fetching ...

Effective Distillation to Hybrid xLSTM Architectures

Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

Abstract

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

Effective Distillation to Hybrid xLSTM Architectures

Abstract

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.
Paper Structure (37 sections, 16 equations, 17 figures, 8 tables)

This paper contains 37 sections, 16 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Win-and-Tie rate ($C_\alpha$) curves of our distilled xLSTM-Qwen2.5-7B-IT(left) and xLSTM-Llama3.1-8B-IT(right) in comparison against the best sub-quadratic baseline across generation benchmarks spanning math, code, STEM, and chat domains. Higher is better.
  • Figure 2: Illustration of our hybrid method consisting of mLSTM, sliding-window attention, and sink tokens. Our approach comprises 4 primary steps: (1) transfer the original teacher weights to the student and introduce adapters and gates, (2) hidden-state matching, (3) subsequent merging of query and key projections, and (4) knowledge distillation.
  • Figure 3: Downstream evaluations for (a) language understanding and (b) language generation tasks. We report the recovery rate relative to teacher scores for our mLSTM-based student and established baselines with comparable parameter counts. The dotted line at $1.0$ indicates parity with the Transformer teacher. Our model matches the teacher's performance across language understanding tasks, while exceeding the teacher on four of the considered generation tasks.
  • Figure 4: Teacher-recovery rates for instruction-tuned xLSTM students and the effect of expert merging.Top:xLSTM-Llama3.1-8B-IT distilled from Llama3.1-8B-IT vs. Mamba-in-Llama; Bottom:xLSTM-Qwen2.5-7B-IT distilled from Qwen2.5-7B-IT vs. QRWKV7-7B-IT. For each benchmark (x-axis; grouped by domain color), we report relative performance as the student/teacher score ratio (y-axis); the dotted line at $1.0$ indicates parity with the Transformer teacher. For our method (left bar in each pair), the merged student is shown. For a given task, the striped area on top of the bar indicates gains (colored) or losses (empty) compared to our linearized domain expert before merging. For the baselines (right bar in each pair), the light bar shows the recovery rate.
  • Figure 5: Inference comparison for the generation stage between the Transformer-based teacher and our xLSTM-based student. In (a), we show generation latency at different generation budgets ($B=1$). In (b), we report the memory consumption in % of GPU memory during the generation ($B=1)$. In (c), we show the generation throughput when generating 100 tokens with varying prefill lengths and $B=8$.
  • ...and 12 more figures