Table of Contents
Fetching ...

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter

TL;DR

The paper introduces xLSTM 7B, a 7B-parameter recurrent LLM designed for fast and memory-efficient inference by leveraging an optimized mLSTM core with linear compute scaling. It features a post-up projection block, reduced-dimension mLSTM, fused generation kernels, and stability enhancements (RMSNorm, gate soft-capping, negative input-gate bias) to enable stable, large-scale pretraining on $2.3\text{T}$ tokens with an $8192$-token context. Empirically, xLSTM 7B achieves competitive language modeling performance compared to Transformer and Mamba baselines while delivering the highest prefill and generation throughput and lower GPU memory footprint in inference benchmarks. The authors also demonstrate strong long-context capabilities, with a long-context cooldown improving performance on very long sequences, and release the model, code, and training pipeline as open-source.

Abstract

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

TL;DR

The paper introduces xLSTM 7B, a 7B-parameter recurrent LLM designed for fast and memory-efficient inference by leveraging an optimized mLSTM core with linear compute scaling. It features a post-up projection block, reduced-dimension mLSTM, fused generation kernels, and stability enhancements (RMSNorm, gate soft-capping, negative input-gate bias) to enable stable, large-scale pretraining on tokens with an -token context. Empirically, xLSTM 7B achieves competitive language modeling performance compared to Transformer and Mamba baselines while delivering the highest prefill and generation throughput and lower GPU memory footprint in inference benchmarks. The authors also demonstrate strong long-context capabilities, with a long-context cooldown improving performance on very long sequences, and release the model, code, and training pipeline as open-source.

Abstract

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

Paper Structure

This paper contains 50 sections, 7 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Sketch of the updated xLSTM Block. The lower part is an output-gated sequence-mix layer with the mLSTM at its core, whereas the upper part is a gated MLP (SwiGLU) as a feature/channel-mix layer. See Fig. \ref{['fig:mLSTMblock_detail']} for details.
  • Figure 2: Loss and Gradient Norm during Pretraining of xLSTM 7B. We show the mean and maximum value over 50 steps. Our enhanced architecture and initialization enable stable pretraining of xLSTM 7B, exhibiting only two brief loss spikes early in training, both of which were rapidly recovered.
  • Figure 3: RULER results of xLSTM 7B in comparison to Transfomers (with and without long context finetuning) and State Space Models, with and without medium context cooldown.
  • Figure 4: Throughput for generating 100 tokens with batch size 1 at varying prefill lengths.
  • Figure 5: Time and GPU memory used for generation of a single sequence of varying lengths for generation without prefill.
  • ...and 8 more figures