Table of Contents
Fetching ...

Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition

Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun

TL;DR

This work presents a unified, circuit-level framework that explains grokking, double descent, and emergent abilities through the competition between memorization and generalization circuits, parameterized by model size and data quantity. It introduces D_crit^M and D_mem^M to delineate four training dynamics and uses modular addition tasks to illustrate grokking and memorization, while demonstrating how multi-task learning can yield emergent abilities. The authors validate predictions about double descent by controlled experiments and show that mixing memorization tasks with algorithmic tasks pushes emergent capabilities to much larger models, offering a fresh perspective on emergent abilities in Large Language Models. The framework connects prior grokking/double-descent literature with emergent behaviors, highlighting the role of task complexity and circuit efficiency in shaping generalization during training.

Abstract

Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in large language models, which challenge human intuition and are crucial for a deeper understanding of neural models. In this paper, we present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits. This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes. Our framework delineates four distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results. Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in Large Language Models.

Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition

TL;DR

This work presents a unified, circuit-level framework that explains grokking, double descent, and emergent abilities through the competition between memorization and generalization circuits, parameterized by model size and data quantity. It introduces D_crit^M and D_mem^M to delineate four training dynamics and uses modular addition tasks to illustrate grokking and memorization, while demonstrating how multi-task learning can yield emergent abilities. The authors validate predictions about double descent by controlled experiments and show that mixing memorization tasks with algorithmic tasks pushes emergent capabilities to much larger models, offering a fresh perspective on emergent abilities in Large Language Models. The framework connects prior grokking/double-descent literature with emergent behaviors, highlighting the role of task complexity and circuit efficiency in shaping generalization during training.

Abstract

Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in large language models, which challenge human intuition and are crucial for a deeper understanding of neural models. In this paper, we present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits. This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes. Our framework delineates four distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results. Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in Large Language Models.
Paper Structure (29 sections, 2 equations, 10 figures)

This paper contains 29 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: The increasing memorization capacity and decreasing critical dataset size for larger models split the figure into four distinct zones including progression, ungrokking, grokking and semi-grokking. Each zone will show a specific training dynamic.
  • Figure 2: This figure illustrates the four distinct training dynamics that correspond to the zones identified in Figure \ref{['fig:theory_graph']} and discussed in Section \ref{['sec:proposed_framework']}. Each panel represents a specific dynamic: (a) Progression, demonstrated using a model with a hidden size of $8$ and trained on $3000$ data points. (b) Ungrokking, shown with a model having a hidden size of $32$, trained on $2600$ data points. (c) Grokking, visualized using a model with a larger hidden size of $64$, also trained on $3000$ data points. These dynamics exemplify the variable responses of models with different configurations to specific training data volumes. (d) Semi-Grokking, depicted with a model of hidden size $32$, trained on $3000$ data points.
  • Figure 3: Final validation accuracy across various training dataset sizes and model hidden sizes. Larger models are represented in green, while smaller models are in blue. This figure demonstrates that models with larger hidden sizes attain near-perfect validation accuracy with comparatively less training data, indicating a reduced critical dataset size for these models.
  • Figure 4: Graphical representation of the model's memorization capacity relative to its size. Each model is run using three distinct random seeds and the average performance is depicted. The light blue shaded region illustrates the 95% confidence interval, which is notably narrow, highlighting the consistency of the memorization capacity across different model sizes.
  • Figure 5: Progression experiments with $5$ different random seeds. All experiments are conducted with a training dataset size of $3000$ and a model hidden size of $8$. Curves with obvious bumps in training accuracy show progression during training.
  • ...and 5 more figures