Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition
Yufei Huang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun
TL;DR
This work presents a unified, circuit-level framework that explains grokking, double descent, and emergent abilities through the competition between memorization and generalization circuits, parameterized by model size and data quantity. It introduces D_crit^M and D_mem^M to delineate four training dynamics and uses modular addition tasks to illustrate grokking and memorization, while demonstrating how multi-task learning can yield emergent abilities. The authors validate predictions about double descent by controlled experiments and show that mixing memorization tasks with algorithmic tasks pushes emergent capabilities to much larger models, offering a fresh perspective on emergent abilities in Large Language Models. The framework connects prior grokking/double-descent literature with emergent behaviors, highlighting the role of task complexity and circuit efficiency in shaping generalization during training.
Abstract
Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in large language models, which challenge human intuition and are crucial for a deeper understanding of neural models. In this paper, we present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits. This approach, initially employed to explain grokking, is extended in our work to encompass a wider range of model sizes and training data volumes. Our framework delineates four distinct training dynamics, each depending on varying combinations of model size and training data quantity. Utilizing this framework, we provide a detailed analysis of the double descent phenomenon and propose two verifiable predictions regarding its occurrence, both substantiated by our experimental results. Moreover, we expand our framework to the multi-task learning paradigm, demonstrating how algorithm tasks can be turned into emergent abilities. This offers a novel perspective to understand emergent abilities in Large Language Models.
