Scaling Efficient LLMs
B. N. Kausik
TL;DR
The paper questions the conventional AI scaling law by deriving a PAC-based bound that the number of parameters in an efficient LLM scales as $D^{\gamma}$ with $\gamma\in[0.44,0.72]$, rather than linearly with data size. It then introduces recurrent transformers that apply a single transformer layer across a fixed-width sliding window, enabling linear-time sequence processing, memory efficiency, and learned history accumulation or forgetting. The authors demonstrate through experiments on long-range image classification, copy/selective-copy tasks with curriculum training, and Shakespeare NLP that recurrent transformers can match or exceed multi-layer transformers at a fraction of compute and parameters, with favorable inference costs. These results suggest pathway to practical, efficient LLMs that scale sublinearly with data while preserving performance, with reproducible code available. The work integrates a theoretical framework with empirical validation across diverse tasks to support the viability of efficient architectures.
Abstract
Recent LLMs have hundreds of billions of parameters consuming vast resources. Furthermore, the so called "AI scaling law" for transformers suggests that the number of parameters must scale linearly with the size of the data. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, by comparing theoretical and empirical estimates of the Kullback-Leibler divergence, we derive a natural AI scaling law that the number of parameters in an efficient LLM scales as $D^γ$ where $D$ is the size of the training data and $ γ\in [0.44, 0.72]$, suggesting the existence of more efficient architectures. Against this backdrop, we propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks, progressively applying a single transformer layer to a fixed-width sliding window across the input sequence. Recurrent transformers (a) run in linear time in the sequence length, (b) are memory-efficient and amenable to parallel processing in large batches, (c) learn to forget history for language tasks, or accumulate history for long range tasks like copy and selective copy, and (d) are amenable to curriculum training to overcome vanishing gradients. In our experiments, we find that recurrent transformers perform favorably on benchmark tests.
