Language Models "Grok" to Copy
Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
TL;DR
This work investigates how context copying emerges during pre-training of Transformer LLMs and proposes a grokking-like transition as the underlying mechanism. Through empirical study of 12-layer LLaMA models trained on $40\text{B}$ tokens, the authors show that copying accuracy surges after training loss stabilizes (around $15\text{B}$ tokens), that the development of copying is governed by update steps rather than total tokens, and that induction heads form from shallow to deep layers as training progresses. They quantify induction-head behavior with $I^{(L,H)}$ and $EP^{(L,H)}$, revealing deeper-layer circuits associated with copying, and demonstrate that regularization (e.g., dropout, weight decay) can accelerate or enhance grokked copying. The work argues that this grokking-to-copy perspective can guide more efficient training strategies to bolster in-context learning and retrieval-augmented generation, with practical implications for improving downstream performance while potentially reducing data and compute requirements. Overall, the paper highlights a coherent link between grokking phenomena and context copying, offering a framework for designing training regimens that improve in-context abilities.
Abstract
We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.
