Table of Contents
Fetching ...

Language Models "Grok" to Copy

Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan

TL;DR

This work investigates how context copying emerges during pre-training of Transformer LLMs and proposes a grokking-like transition as the underlying mechanism. Through empirical study of 12-layer LLaMA models trained on $40\text{B}$ tokens, the authors show that copying accuracy surges after training loss stabilizes (around $15\text{B}$ tokens), that the development of copying is governed by update steps rather than total tokens, and that induction heads form from shallow to deep layers as training progresses. They quantify induction-head behavior with $I^{(L,H)}$ and $EP^{(L,H)}$, revealing deeper-layer circuits associated with copying, and demonstrate that regularization (e.g., dropout, weight decay) can accelerate or enhance grokked copying. The work argues that this grokking-to-copy perspective can guide more efficient training strategies to bolster in-context learning and retrieval-augmented generation, with practical implications for improving downstream performance while potentially reducing data and compute requirements. Overall, the paper highlights a coherent link between grokking phenomena and context copying, offering a framework for designing training regimens that improve in-context abilities.

Abstract

We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.

Language Models "Grok" to Copy

TL;DR

This work investigates how context copying emerges during pre-training of Transformer LLMs and proposes a grokking-like transition as the underlying mechanism. Through empirical study of 12-layer LLaMA models trained on tokens, the authors show that copying accuracy surges after training loss stabilizes (around tokens), that the development of copying is governed by update steps rather than total tokens, and that induction heads form from shallow to deep layers as training progresses. They quantify induction-head behavior with and , revealing deeper-layer circuits associated with copying, and demonstrate that regularization (e.g., dropout, weight decay) can accelerate or enhance grokked copying. The work argues that this grokking-to-copy perspective can guide more efficient training strategies to bolster in-context learning and retrieval-augmented generation, with practical implications for improving downstream performance while potentially reducing data and compute requirements. Overall, the paper highlights a coherent link between grokking phenomena and context copying, offering a framework for designing training regimens that improve in-context abilities.

Abstract

We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context--a fundamental skill for various LLM applications, including in-context learning (ICL) and retrieval-augmented generation (RAG). We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking, which refers to sudden generalization on test set long after the model fit to the training set. Our experiments yield three arguments: (1) The pre-training loss decreases rapidly, while the context copying ability of models initially lags and then abruptly saturates. (2) The speed of developing copying ability is independent of the number of tokens trained, similarly to how grokking speed is unaffected by dataset size as long as the data distribution is preserved. (3) Induction heads, the attention heads responsible for copying, form from shallow to deep layers during training, mirroring the development of circuits in deeper layers during grokking. We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training, ultimately improving in-context performance. For example, we demonstrated that techniques that enhance grokking, such as regularization, either accelerate or enhance the development of context copying.
Paper Structure (15 sections, 1 equation, 7 figures)

This paper contains 15 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: An test input example when $i=1$. The correct completion of this input should be ba717e.
  • Figure 2: We illustrate the average context copying accuracy by the bars, and the pre-training loss by the line. The X-axis represents the number of tokens trained. A clear grokked copying occurs at 15B tokens.
  • Figure 3: We manage the token count trained at specific steps by adjusting the batch size. Three models trained with different batch size develop fundamental copying abilities after around 38,000 update steps, despite training on varying numbers of tokens.
  • Figure 4: With a fixed learning rate,the convergence rate on the training set, as indicated by the training loss, is related to the token count. However, under similar convergence rates, the copying capacity varies significantly, which is influenced by the number of update steps.
  • Figure 5: With a fixed batch size (64), a larger learning rate accelerates the grokking to copy.
  • ...and 2 more figures