Table of Contents
Fetching ...

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

Maximilian Beck, Kajetan Schweighofer, Sebastian Böck, Sebastian Lehner, Sepp Hochreiter

TL;DR

This work addresses how scaling laws apply to linear-time architectures like xLSTM versus quadratic-time Transformers in language modeling. It uses two fitting paradigms (Parametric and IsoFLOP) to map cross-entropy loss $L$ to model size $N$ and data $D$ under compute $C(N,D)$, examining compute-optimal and over-training regimes. The main findings are that xLSTM is Pareto-dominant in training loss for a given compute budget, compute-optimal xLSTM models tend to be larger, and xLSTM maintains constant power-law exponents in over-training; moreover, inference-speed advantages grow with context length due to linear vs quadratic scaling. Collectively, the results position xLSTM as a scalable alternative to Transformers with favorable training and inference properties, particularly for long-context tasks, and they provide a quantitative runtime model that matches empirical measurements and guides architecture choices.

Abstract

Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Importantly, xLSTM's advantage widens as training and inference contexts grow.

xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

TL;DR

This work addresses how scaling laws apply to linear-time architectures like xLSTM versus quadratic-time Transformers in language modeling. It uses two fitting paradigms (Parametric and IsoFLOP) to map cross-entropy loss to model size and data under compute , examining compute-optimal and over-training regimes. The main findings are that xLSTM is Pareto-dominant in training loss for a given compute budget, compute-optimal xLSTM models tend to be larger, and xLSTM maintains constant power-law exponents in over-training; moreover, inference-speed advantages grow with context length due to linear vs quadratic scaling. Collectively, the results position xLSTM as a scalable alternative to Transformers with favorable training and inference properties, particularly for long-context tasks, and they provide a quantitative runtime model that matches empirical measurements and guides architecture choices.

Abstract

Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Importantly, xLSTM's advantage widens as training and inference contexts grow.

Paper Structure

This paper contains 83 sections, 24 equations, 18 figures, 22 tables.

Figures (18)

  • Figure 1: xLSTM scaling laws: Validation loss over training compute. Left: xLSTM is pareto-dominant over dense multi-head Transformers in terms of loss. For a fixed FLOP budget, xLSTM models are better. For a fixed validation loss, xLSTM models require less FLOPs. Right: Parametric fit of the loss surface $L(N,D)$ as a function of model size $N$ and dataset size $D$.
  • Figure 2: Dataset of training runs for our scaling law study. The dataset contains training runs for the xLSTM and the Transformer architecture, with two configurations each: IsoFLOP and Token/Param.
  • Figure 3: Power law fits to loss over training compute with increasing token-to-parameter (Token/Param) ratios $M$. We fit power laws of the form in $\hat{L}(C) = \lambda \cdot C^{-\eta}$ and observe that---similar to Transformer---the exponents $\eta$ of xLSTM remain constant even for large $M$, indicated by the parallel lines in the log-log plot.
  • Figure 4: Varying model size and tokens with a fixed compute budget (IsoFLOP). Left: IsoFLOP profiles for varying number of model parameters with a marker at the minimum $N^*$ of the fitted polynomial. Right: Power-law fit $N^*(H) = A'\cdot H^{a}$ for the compute optimal number of model parameters. Our setup reproduces the power-law exponent $a$ for Transformers in porian:24resolving.
  • Figure 5: Left: IsoFLOP curves as a function of model parameters at 3 different context lengths. Right: Plot of the power-law fits for the compute optimal number of parameters dependent on the compute budget $N^*(H)$. Colors indicate compute budget and marker types indicate the model types.
  • ...and 13 more figures