Explaining Grokking in Transformers through the Lens of Inductive Bias
Jaisidh Singh, Diganta Misra, Antonio Orvieto
TL;DR
The paper investigates grokking in transformers through the lens of inductive bias, focusing on how architectural choices (notably Layer Normalization position) and optimization settings shape the rate and nature of generalization. It introduces a one-layer transformer trained on modular addition and analyzes how LN placement drives distinct biases, including shortcut learning, attention entropy, and the emergence of Fourier-like, periodic solutions. It further shows that optimization factors such as learning rate, weight decay, and readout scale interact with optimizer behavior (e.g., AdamW) in nuanced ways, sometimes confounding lazy-to-rich interpretations and revealing that feature evolution is continuous and compressibility-driven. Across LN configurations and optimization modulators, the results reveal a coherent link between inductive bias, feature compressibility, and generalization, suggesting that grokking in transformers is a nuanced phenomenon governed by continuous feature evolution under specific architectural and optimization biases.
Abstract
We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at https://tinyurl.com/y52u3cad.
