Table of Contents
Fetching ...

Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

Patrick Haller, Jonas Golde, Alan Akbik

TL;DR

This work tackles the quadratic bottleneck of Transformer self-attention by distilling a Transformer teacher into eight subquadratic backbones. It employs a MOHAWK-inspired framework combining cross-entropy and KL objectives with alignment techniques—matrix mixing, QKV copying, and hidden-state alignment—to transfer teacher inductive biases to linearized or recurrent backbones. Key findings show that subquadratic models with explicit memory dynamics (e.g., xLSTM, GLA, MetaLA) can recover up to a majority of the teacher's performance, with hidden-state alignment providing the most reliable gains, while QKV copying offers initialization benefits but is not sufficient alone. The results demonstrate the viability of cross-architecture distillation for efficient language models, offering practical guidance on architectural choices and alignment strategies and providing resources to enable broader exploration.

Abstract

Knowledge distillation is a widely used technique for compressing large language models (LLMs), in which a smaller student model is trained to mimic a larger teacher model. Typically, both the teacher and student models are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention during inference remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher model to eight subquadratic student architectures. Our study investigates which subquadratic model can most effectively approximate the teacher model's learned representations through knowledge distillation, and how different architectural design choices influence the training dynamics. We further investigate the impact of initialization strategies, such as matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.

Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

TL;DR

This work tackles the quadratic bottleneck of Transformer self-attention by distilling a Transformer teacher into eight subquadratic backbones. It employs a MOHAWK-inspired framework combining cross-entropy and KL objectives with alignment techniques—matrix mixing, QKV copying, and hidden-state alignment—to transfer teacher inductive biases to linearized or recurrent backbones. Key findings show that subquadratic models with explicit memory dynamics (e.g., xLSTM, GLA, MetaLA) can recover up to a majority of the teacher's performance, with hidden-state alignment providing the most reliable gains, while QKV copying offers initialization benefits but is not sufficient alone. The results demonstrate the viability of cross-architecture distillation for efficient language models, offering practical guidance on architectural choices and alignment strategies and providing resources to enable broader exploration.

Abstract

Knowledge distillation is a widely used technique for compressing large language models (LLMs), in which a smaller student model is trained to mimic a larger teacher model. Typically, both the teacher and student models are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention during inference remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher model to eight subquadratic student architectures. Our study investigates which subquadratic model can most effectively approximate the teacher model's learned representations through knowledge distillation, and how different architectural design choices influence the training dynamics. We further investigate the impact of initialization strategies, such as matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.

Paper Structure

This paper contains 21 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of our knowledge distillation approach. We replace the softmax attention mechanism in transformer models with various subquadratic modules and train the resulting models using knowledge distillation and additional alignment techniques.
  • Figure 2: Long-context evaluation. Left: Perplexity over increasing context lengths. Right: LongBench scores. Models with dynamic decay terms (xLSTM, GLA, MetaLA) retain performance across increasing context lengths, while others show degradation.
  • Figure 3: Loss plots for all runs conducted in Experiment 1. Green line plots indicate only Stage 3 training, while red and blue indicate Stage 2+3 and 1+2+3 Stage respectively.
  • Figure 4: Inference efficiency and memory consumption of linear and softmax attention models, evaluated across single sequences of varying lengths.