Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks
Gouki Minegishi, Yusuke Iwasawa, Yutaka Matsuo
TL;DR
The paper tackles the puzzle of grokking by linking it to the lottery ticket hypothesis and investigating how internal subnetworks influence delayed generalization. It demonstrates that grokked tickets—subnetworks identified during the generalizing phase—significantly reduce the time to generalization across modular arithmetic tasks, polynomial regression, sparse parity, and MNIST, and that this effect cannot be explained by weight norms or sparsity alone. Through controlled experiments, the authors show that good structure, rather than mere sparsity or norm reduction, underlies the faster transition from memorization to generalization; they also show that pruning mechanisms like edge-popup can identify these beneficial structures without changing weights. The work highlights structural exploration as a key mechanism in grokking, linking network topology, periodic representations, and graph properties to improved generalization, and suggests practical structure-aware strategies for accelerating learning in neural networks. Overall, the findings offer a structural perspective on regularization and pave the way for architecture- and pruning-driven approaches to mitigate grokking in diverse tasks and models.
Abstract
Grokking is an intriguing phenomenon of delayed generalization, where neural networks initially memorize training data with perfect accuracy but exhibit poor generalization, subsequently transitioning to a generalizing solution with continued training. While factors such as weight norms and sparsity have been proposed to explain this delayed generalization, the influence of network structure remains underexplored. In this work, we link the grokking phenomenon to the lottery ticket hypothesis to investigate the impact of internal network structures. We demonstrate that utilizing lottery tickets obtained during the generalizing phase (termed grokked tickets) significantly reduces delayed generalization across various tasks, including multiple modular arithmetic operations, polynomial regression, sparse parity, and MNIST classification. Through controlled experiments, we show that the mitigation of delayed generalization is not due solely to reduced weight norms or increased sparsity, but rather to the discovery of good subnetworks. Furthermore, we find that grokked tickets exhibit periodic weight patterns, beneficial graph properties such as increased average path lengths and reduced clustering coefficients, and undergo rapid structural changes that coincide with improvements in generalization. Additionally, pruning techniques like the edge-popup algorithm can identify these effective structures without modifying the weights, thereby transforming memorizing networks into generalizing ones. These results underscore the novel insight that structural exploration plays a pivotal role in understanding grokking. The implementation code can be accessed via this link: https://github.com/gouki510/Grokking-Tickets.
