Table of Contents
Fetching ...

Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures

Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem Mkaouer

TL;DR

This work tackles the high computational cost of attention in Transformer models by teaching a recurrent xLSTM-based small language model to imitate attention dynamics through cross-architecture distillation. Distil-xLSTM reuses the teacher's embedding and classification head, incorporates a six-layer xLSTM with alternating sLSTM and mLSTM blocks, and uses Delta-distillation with time-varying $\alpha$ and $T$ to gradually shift from teacher guidance to hard labels. A Frobenius-norm regularization term further stabilizes training by aligning latent representations across architectures. Results show convergence and competitive performance on 512M-token-scale data with only ~$15\%$ trainable parameters, highlighting the practical potential of efficient, attention-approximate recurrent models for resource-constrained settings.

Abstract

The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.

Distil-xLSTM: Learning Attention Mechanisms through Recurrent Structures

TL;DR

This work tackles the high computational cost of attention in Transformer models by teaching a recurrent xLSTM-based small language model to imitate attention dynamics through cross-architecture distillation. Distil-xLSTM reuses the teacher's embedding and classification head, incorporates a six-layer xLSTM with alternating sLSTM and mLSTM blocks, and uses Delta-distillation with time-varying and to gradually shift from teacher guidance to hard labels. A Frobenius-norm regularization term further stabilizes training by aligning latent representations across architectures. Results show convergence and competitive performance on 512M-token-scale data with only ~ trainable parameters, highlighting the practical potential of efficient, attention-approximate recurrent models for resource-constrained settings.

Abstract

The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.

Paper Structure

This paper contains 12 sections, 14 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our distillation framework with frozen embedding layer and classification head initialized using the teacher's weights.
  • Figure 2: $\mathcal{L}_{\text{CE}}$ during training
  • Figure 3: $\mathcal{L}_{\text{KL}}$ during training
  • Figure 4: Overall loss ($\mathcal{L}_{\text{distill}}$) during training
  • Figure 5: Gradients norm during training
  • ...and 1 more figures