Table of Contents
Fetching ...

Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks

Gouki Minegishi, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

The paper tackles the puzzle of grokking by linking it to the lottery ticket hypothesis and investigating how internal subnetworks influence delayed generalization. It demonstrates that grokked tickets—subnetworks identified during the generalizing phase—significantly reduce the time to generalization across modular arithmetic tasks, polynomial regression, sparse parity, and MNIST, and that this effect cannot be explained by weight norms or sparsity alone. Through controlled experiments, the authors show that good structure, rather than mere sparsity or norm reduction, underlies the faster transition from memorization to generalization; they also show that pruning mechanisms like edge-popup can identify these beneficial structures without changing weights. The work highlights structural exploration as a key mechanism in grokking, linking network topology, periodic representations, and graph properties to improved generalization, and suggests practical structure-aware strategies for accelerating learning in neural networks. Overall, the findings offer a structural perspective on regularization and pave the way for architecture- and pruning-driven approaches to mitigate grokking in diverse tasks and models.

Abstract

Grokking is an intriguing phenomenon of delayed generalization, where neural networks initially memorize training data with perfect accuracy but exhibit poor generalization, subsequently transitioning to a generalizing solution with continued training. While factors such as weight norms and sparsity have been proposed to explain this delayed generalization, the influence of network structure remains underexplored. In this work, we link the grokking phenomenon to the lottery ticket hypothesis to investigate the impact of internal network structures. We demonstrate that utilizing lottery tickets obtained during the generalizing phase (termed grokked tickets) significantly reduces delayed generalization across various tasks, including multiple modular arithmetic operations, polynomial regression, sparse parity, and MNIST classification. Through controlled experiments, we show that the mitigation of delayed generalization is not due solely to reduced weight norms or increased sparsity, but rather to the discovery of good subnetworks. Furthermore, we find that grokked tickets exhibit periodic weight patterns, beneficial graph properties such as increased average path lengths and reduced clustering coefficients, and undergo rapid structural changes that coincide with improvements in generalization. Additionally, pruning techniques like the edge-popup algorithm can identify these effective structures without modifying the weights, thereby transforming memorizing networks into generalizing ones. These results underscore the novel insight that structural exploration plays a pivotal role in understanding grokking. The implementation code can be accessed via this link: https://github.com/gouki510/Grokking-Tickets.

Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks

TL;DR

The paper tackles the puzzle of grokking by linking it to the lottery ticket hypothesis and investigating how internal subnetworks influence delayed generalization. It demonstrates that grokked tickets—subnetworks identified during the generalizing phase—significantly reduce the time to generalization across modular arithmetic tasks, polynomial regression, sparse parity, and MNIST, and that this effect cannot be explained by weight norms or sparsity alone. Through controlled experiments, the authors show that good structure, rather than mere sparsity or norm reduction, underlies the faster transition from memorization to generalization; they also show that pruning mechanisms like edge-popup can identify these beneficial structures without changing weights. The work highlights structural exploration as a key mechanism in grokking, linking network topology, periodic representations, and graph properties to improved generalization, and suggests practical structure-aware strategies for accelerating learning in neural networks. Overall, the findings offer a structural perspective on regularization and pave the way for architecture- and pruning-driven approaches to mitigate grokking in diverse tasks and models.

Abstract

Grokking is an intriguing phenomenon of delayed generalization, where neural networks initially memorize training data with perfect accuracy but exhibit poor generalization, subsequently transitioning to a generalizing solution with continued training. While factors such as weight norms and sparsity have been proposed to explain this delayed generalization, the influence of network structure remains underexplored. In this work, we link the grokking phenomenon to the lottery ticket hypothesis to investigate the impact of internal network structures. We demonstrate that utilizing lottery tickets obtained during the generalizing phase (termed grokked tickets) significantly reduces delayed generalization across various tasks, including multiple modular arithmetic operations, polynomial regression, sparse parity, and MNIST classification. Through controlled experiments, we show that the mitigation of delayed generalization is not due solely to reduced weight norms or increased sparsity, but rather to the discovery of good subnetworks. Furthermore, we find that grokked tickets exhibit periodic weight patterns, beneficial graph properties such as increased average path lengths and reduced clustering coefficients, and undergo rapid structural changes that coincide with improvements in generalization. Additionally, pruning techniques like the edge-popup algorithm can identify these effective structures without modifying the weights, thereby transforming memorizing networks into generalizing ones. These results underscore the novel insight that structural exploration plays a pivotal role in understanding grokking. The implementation code can be accessed via this link: https://github.com/gouki510/Grokking-Tickets.
Paper Structure (40 sections, 22 equations, 20 figures, 1 table)

This paper contains 40 sections, 22 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: (Left) Accuracy of dense model and the lottery ticket obtained at generalizing solution (grokked ticket). When using a lottery ticket (good subnetworks), the train and test accuracy increase almost similarly, i.e., the time from memorization ($t_\text{mem}$) to generalization ($t_\text{gen}$) has significantly accelerated. Note that not only the subtraction ($t_\text{gen} - t_\text{mem}$) but the ratio ($t_\text{gen} / t_\text{mem}$) is also significantly improved, meaning that it's not just a matter of faster learning. (Right) Three hypotheses on why delayed generalization is reduced with a lottery ticket. We show that it is not due to a reduction in weight norm or an increase in sparsity, but rather the discovery of good structure.
  • Figure 2: Comparing the grokking speed of dense networks and grokked tickets on various setups. (a) Modular addition with MLP, (b) Modular addition with Transformer, and (c) Other modular arithmetic tasks (represented by color) and experiments other than modular arithmetic: (d) loss on polynomial regression, (e) accuracy on sparse parity. The dashed line represents the accuracy of the base model, and the solid line represents that of grokked tickets. In all setups, the time to generalization ($t_\text{gen}$) is reduced by grokked tickets.
  • Figure 3: Quantitative comparison of grokking speed among different pruning rates. Note that pruning rate = 0.0 corresponds to the dense network. The definition of the $\tau_\text{grok}$ is explained in \ref{['subsec: Exp setup']}
  • Figure 4: (a) Comparison of the test accuracy of different epoch $t$ in which lottery tickets are acquired. We conducted every 2k epochs. The lottery tickets obtained before 25k epochs (non-grokked tickets) do not fully generalize. Additionally, this generalization ability corresponds to the test accuracy of the base model. The lottery tickets obtained after 25k epochs (grokked tickets) reduced delayed generalization. (b) The effect of pruning rate $k$ on grokked tickets. We conducted every 0.2 pruning rate. Most pruning ratios (0.1, 0.3, 0.5, and 0.7) accelerate the generalization, indicating that the above observation does not depend heavily on the selection of the pruning ratio.
  • Figure 5: (a) Test accuracy dynamics of the base model, grokked ticket, and controlled dense model (L1 norm and L2 norm). The grokked ticket reaches generalization much faster than other models. (b) Comparing test accuracy of the different pruning methods. All PaI methods perform worse than the base model or, in some cases, perform worse than the random pruning. These results indicate neither the weight norm nor the sparsity alone is the cause of delayed generalization.
  • ...and 15 more figures