Table of Contents
Fetching ...

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

Jack Miller, Charles O'Neill, Thang Bui

TL;DR

This work broadens the grokking phenomenon beyond neural networks by showing its presence in Gaussian processes, linear models, and Bayesian neural networks, and introduces a model-agnostic mechanism tied to the balance between error and complexity. A novel concealment data-augmentation strategy demonstrates how grokking can be induced in algorithmic tasks, and parameter-space analyses reveal how grokking maps onto transitions between high- and low-complexity solution regions. The authors connect prior loss-, representation-, and NTK-based theories under a unified, parsimony-driven framework and argue that grokking is fundamentally model-agnostic, arising whenever solution search is guided by both error and complexity. The findings have implications for understanding generalisation dynamics in a wide range of models and datasets, with practical relevance for mitigating or leveraging late generalisation in real-world applications.

Abstract

In some settings neural networks exhibit a phenomenon known as \textit{grokking}, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression, linear regression and Bayesian neural networks. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures shows that grokking is not restricted to settings considered in current theoretical and empirical studies. Instead, grokking may be possible in any model where solution search is guided by complexity and error.

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

TL;DR

This work broadens the grokking phenomenon beyond neural networks by showing its presence in Gaussian processes, linear models, and Bayesian neural networks, and introduces a model-agnostic mechanism tied to the balance between error and complexity. A novel concealment data-augmentation strategy demonstrates how grokking can be induced in algorithmic tasks, and parameter-space analyses reveal how grokking maps onto transitions between high- and low-complexity solution regions. The authors connect prior loss-, representation-, and NTK-based theories under a unified, parsimony-driven framework and argue that grokking is fundamentally model-agnostic, arising whenever solution search is guided by both error and complexity. The findings have implications for understanding generalisation dynamics in a wide range of models and datasets, with practical relevance for mitigating or leveraging late generalisation in real-world applications.

Abstract

In some settings neural networks exhibit a phenomenon known as \textit{grokking}, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression, linear regression and Bayesian neural networks. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures shows that grokking is not restricted to settings considered in current theoretical and empirical studies. Instead, grokking may be possible in any model where solution search is guided by complexity and error.
Paper Structure (51 sections, 2 theorems, 32 equations, 22 figures, 2 tables)

This paper contains 51 sections, 2 theorems, 32 equations, 22 figures, 2 tables.

Key Result

Corollary 1

The phenomenon of grokking should be model agnostic. Namely, it could occur in any setting in which solution search is guided by complexity and error.

Figures (22)

  • Figure 1: Accuracy, data fit and complexity on zero-one slope classification task with a linear model. Note that the shaded region corresponds to the standard error of five training runs. Further, the grey line marks the point of minimum data fit.
  • Figure 2: Accuracy and log likelihoods on zero-one classification task with a RBF Gaussian process. Note that the shaded region corresponds to the standard error of five training runs.
  • Figure 3: Accuracy and log likelihoods on hidden parity prediction task with RBF Gaussian process. Note that the shaded region corresponds to the standard error of five training runs. Acc. is Accuracy and Val. is Validation.
  • Figure 4: Relationship between grokking gap and number of additional dimensions using the grokking via concealment strategy. Note that $x$-values are artificially perturbed to allow for easier visibility of error bars. In reality they are either $10$, $20$, $30$ or $40$. Also, the data of zero additional length is removed (although still influences the regression fit). See Appendix \ref{['appendix:original-concealment-plot']} for the plot without these changes.
  • Figure 5: Trajectories through parameter landscape for GP regression. Initialisation points A-C refer to those mentioned in Section \ref{['sec:gp-on-sinusoidal-example']}.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Corollary 1
  • Definition B.1: Kolmogorov
  • Theorem 1