Table of Contents
Fetching ...

Scaling Laws for Linear Complexity Language Models

Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong

TL;DR

The paper investigates scaling laws for linear complexity language models, introducing three architectures (TNL, HGRN2, cosFormer2) and using LLaMA as a softmax-baseline. By pre-training 70M–7B models on a 300B-token corpus and evaluating across CSR, NIAH, and SCROLLS, the authors demonstrate that linear-complexity models scale comparably to traditional transformers in language proficiency and knowledge retention, though retrieval tasks reveal limitations due to fixed hidden state. They derive power-law relationships between loss, compute, model size, and data, identifying compute-optimal allocations with exponents a≈0.64–0.71 for N_opt and b≈0.45–0.51 for D_opt, and discuss how architecture and context-length choices influence downstream performance. Overall, the work provides a framework for predictably scaling linear attention and related linear models, highlighting their potential for compute-efficient LLM development alongside persistent retrieval challenges. The findings imply that linear-complexity models can achieve robust language understanding while offering favorable compute dynamics, albeit with task-dependent retrieval trade-offs.

Abstract

The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.

Scaling Laws for Linear Complexity Language Models

TL;DR

The paper investigates scaling laws for linear complexity language models, introducing three architectures (TNL, HGRN2, cosFormer2) and using LLaMA as a softmax-baseline. By pre-training 70M–7B models on a 300B-token corpus and evaluating across CSR, NIAH, and SCROLLS, the authors demonstrate that linear-complexity models scale comparably to traditional transformers in language proficiency and knowledge retention, though retrieval tasks reveal limitations due to fixed hidden state. They derive power-law relationships between loss, compute, model size, and data, identifying compute-optimal allocations with exponents a≈0.64–0.71 for N_opt and b≈0.45–0.51 for D_opt, and discuss how architecture and context-length choices influence downstream performance. Overall, the work provides a framework for predictably scaling linear attention and related linear models, highlighting their potential for compute-efficient LLM development alongside persistent retrieval challenges. The findings imply that linear-complexity models can achieve robust language understanding while offering favorable compute dynamics, albeit with task-dependent retrieval trade-offs.

Abstract

The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.

Paper Structure

This paper contains 34 sections, 21 equations, 30 figures, 13 tables.

Figures (30)

  • Figure 1: Training Curve Fitting for Four Architectures. In the master row, we present predicted training curves for various architectures, with each subsequent row representing a different architecture. On the left, the training curves for models ranging from 70M to 7B parameters are displayed. From these curves, we extract the envelope of minimum loss per FLOP, using these data points to estimate the optimal model size (center) for a specified compute budget, and the optimal number of training tokens (right).
  • Figure 2: Comparative performance across distinct benchmarks illustrating the scaling trends observed in evaluation metrics. The figure highlights the progressive improvement in model performance as the complexity and size of the models increase, underscoring the significance of scaling in enhancing benchmark outcomes.
  • Figure :
  • Figure :
  • Figure :
  • ...and 25 more figures