Scaling Efficient LLMs

B. N. Kausik

Scaling Efficient LLMs

B. N. Kausik

TL;DR

The paper questions the conventional AI scaling law by deriving a PAC-based bound that the number of parameters in an efficient LLM scales as $D^{\gamma}$ with $\gamma\in[0.44,0.72]$, rather than linearly with data size. It then introduces recurrent transformers that apply a single transformer layer across a fixed-width sliding window, enabling linear-time sequence processing, memory efficiency, and learned history accumulation or forgetting. The authors demonstrate through experiments on long-range image classification, copy/selective-copy tasks with curriculum training, and Shakespeare NLP that recurrent transformers can match or exceed multi-layer transformers at a fraction of compute and parameters, with favorable inference costs. These results suggest pathway to practical, efficient LLMs that scale sublinearly with data while preserving performance, with reproducible code available. The work integrates a theoretical framework with empirical validation across diverse tasks to support the viability of efficient architectures.

Abstract

Recent LLMs have hundreds of billions of parameters consuming vast resources. Furthermore, the so called "AI scaling law" for transformers suggests that the number of parameters must scale linearly with the size of the data. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, by comparing theoretical and empirical estimates of the Kullback-Leibler divergence, we derive a natural AI scaling law that the number of parameters in an efficient LLM scales as $D^γ$ where $D$ is the size of the training data and $ γ\in [0.44, 0.72]$, suggesting the existence of more efficient architectures. Against this backdrop, we propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks, progressively applying a single transformer layer to a fixed-width sliding window across the input sequence. Recurrent transformers (a) run in linear time in the sequence length, (b) are memory-efficient and amenable to parallel processing in large batches, (c) learn to forget history for language tasks, or accumulate history for long range tasks like copy and selective copy, and (d) are amenable to curriculum training to overcome vanishing gradients. In our experiments, we find that recurrent transformers perform favorably on benchmark tests.

Scaling Efficient LLMs

TL;DR

The paper questions the conventional AI scaling law by deriving a PAC-based bound that the number of parameters in an efficient LLM scales as

with

, rather than linearly with data size. It then introduces recurrent transformers that apply a single transformer layer across a fixed-width sliding window, enabling linear-time sequence processing, memory efficiency, and learned history accumulation or forgetting. The authors demonstrate through experiments on long-range image classification, copy/selective-copy tasks with curriculum training, and Shakespeare NLP that recurrent transformers can match or exceed multi-layer transformers at a fraction of compute and parameters, with favorable inference costs. These results suggest pathway to practical, efficient LLMs that scale sublinearly with data while preserving performance, with reproducible code available. The work integrates a theoretical framework with empirical validation across diverse tasks to support the viability of efficient architectures.

Abstract

where

is the size of the training data and

, suggesting the existence of more efficient architectures. Against this backdrop, we propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks, progressively applying a single transformer layer to a fixed-width sliding window across the input sequence. Recurrent transformers (a) run in linear time in the sequence length, (b) are memory-efficient and amenable to parallel processing in large batches, (c) learn to forget history for language tasks, or accumulate history for long range tasks like copy and selective copy, and (d) are amenable to curriculum training to overcome vanishing gradients. In our experiments, we find that recurrent transformers perform favorably on benchmark tests.

Paper Structure (12 sections, 2 theorems, 30 equations, 3 figures, 6 tables)

This paper contains 12 sections, 2 theorems, 30 equations, 3 figures, 6 tables.

Introduction
Background
Empirical Scaling
Theoretical Scaling
Recurrent Transformers
Computational Complexity
Experimental Results
Long Range Image Classification
Copy and Selective Copy with Curriculum Training
Natural Language Processing
Summary
Reproducibility

Key Result

Theorem 1

Given are a sequence length $l$, a class of LLMs $F$, and a corpus $T$. Let $S$ be the set of sequences in $T$. (a) There exists a learning algorithm for $F$, such that for a given $\delta >0$, with probability $(1-\delta)$ the excess risk is $\mathcal{O} \left [ \sqrt{\frac{|S|}{|T|}} \right]$; (b

Figures (3)

Figure 1: Long Range Image Classification: 4-layer regular (left) & recurrent transformer (right)
Figure 2: Curriculum training of recurrent transformer for Selective Copy
Figure 3: Shakespeare LLM: Training loss (left) and Validation Loss (right)

Theorems & Definitions (8)

Definition 1
Definition 2
Definition 3
Definition 4
Theorem 1
proof
Lemma 1
proof

Scaling Efficient LLMs

TL;DR

Abstract

Scaling Efficient LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)