Scaling Laws for Linear Complexity Language Models
Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong
TL;DR
The paper investigates scaling laws for linear complexity language models, introducing three architectures (TNL, HGRN2, cosFormer2) and using LLaMA as a softmax-baseline. By pre-training 70M–7B models on a 300B-token corpus and evaluating across CSR, NIAH, and SCROLLS, the authors demonstrate that linear-complexity models scale comparably to traditional transformers in language proficiency and knowledge retention, though retrieval tasks reveal limitations due to fixed hidden state. They derive power-law relationships between loss, compute, model size, and data, identifying compute-optimal allocations with exponents a≈0.64–0.71 for N_opt and b≈0.45–0.51 for D_opt, and discuss how architecture and context-length choices influence downstream performance. Overall, the work provides a framework for predictably scaling linear attention and related linear models, highlighting their potential for compute-efficient LLM development alongside persistent retrieval challenges. The findings imply that linear-complexity models can achieve robust language understanding while offering favorable compute dynamics, albeit with task-dependent retrieval trade-offs.
Abstract
The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.
