Table of Contents
Fetching ...

Towards Empirical Interpretation of Internal Circuits and Properties in Grokked Transformers on Modular Polynomials

Hiroki Furuta, Gouki Minegishi, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

The paper extends the mechanistic interpretability of grokking from modular addition to subtraction, multiplication, and polynomials, using small Transformer models and novel Fourier-based metrics. It introduces pre-grokking models and defines Fourier Frequency Density (FFD) and Fourier Coefficient Ratio (FCR) to quantify how internal representations evolve during grokking and to characterize operation-specific Fourier structures. Key findings show that grokking manifests distinctive Fourier patterns for different modular operations, with subtraction showing asymmetry, multiplication engaging a broad frequency base, and polynomials revealing a superposition of representations and the importance of factorization. Transferability of learned representations is limited and highly operation-specific, while carefully designed multi-task mixtures can induce co-grokking and accelerate generalization for certain polynomial ensembles. These results provide empirical steps toward interpreting internal circuits in grokked Transformers and highlight both the potential and limits of extending interpretability analyses beyond modular addition to more complex arithmetic tasks.

Abstract

Grokking has been actively explored to reveal the mystery of delayed generalization and identifying interpretable representations and algorithms inside the grokked models is a suggestive hint to understanding its mechanism. Grokking on modular addition has been known to implement Fourier representation and its calculation circuits with trigonometric identities in Transformers. Considering the periodicity in modular arithmetic, the natural question is to what extent these explanations and interpretations hold for the grokking on other modular operations beyond addition. For a closer look, we first hypothesize that any modular operations can be characterized with distinctive Fourier representation or internal circuits, grokked models obtain common features transferable among similar operations, and mixing datasets with similar operations promotes grokking. Then, we extensively examine them by learning Transformers on complex modular arithmetic tasks, including polynomials. Our Fourier analysis and novel progress measure for modular arithmetic, Fourier Frequency Density and Fourier Coefficient Ratio, characterize distinctive internal representations of grokked models per modular operation; for instance, polynomials often result in the superposition of the Fourier components seen in elementary arithmetic, but clear patterns do not emerge in challenging non-factorizable polynomials. In contrast, our ablation study on the pre-grokked models reveals that the transferability among the models grokked with each operation can be only limited to specific combinations, such as from elementary arithmetic to linear expressions. Moreover, some multi-task mixtures may lead to co-grokking -- where grokking simultaneously happens for all the tasks -- and accelerate generalization, while others may not find optimal solutions. We provide empirical steps towards the interpretability of internal circuits.

Towards Empirical Interpretation of Internal Circuits and Properties in Grokked Transformers on Modular Polynomials

TL;DR

The paper extends the mechanistic interpretability of grokking from modular addition to subtraction, multiplication, and polynomials, using small Transformer models and novel Fourier-based metrics. It introduces pre-grokking models and defines Fourier Frequency Density (FFD) and Fourier Coefficient Ratio (FCR) to quantify how internal representations evolve during grokking and to characterize operation-specific Fourier structures. Key findings show that grokking manifests distinctive Fourier patterns for different modular operations, with subtraction showing asymmetry, multiplication engaging a broad frequency base, and polynomials revealing a superposition of representations and the importance of factorization. Transferability of learned representations is limited and highly operation-specific, while carefully designed multi-task mixtures can induce co-grokking and accelerate generalization for certain polynomial ensembles. These results provide empirical steps toward interpreting internal circuits in grokked Transformers and highlight both the potential and limits of extending interpretability analyses beyond modular addition to more complex arithmetic tasks.

Abstract

Grokking has been actively explored to reveal the mystery of delayed generalization and identifying interpretable representations and algorithms inside the grokked models is a suggestive hint to understanding its mechanism. Grokking on modular addition has been known to implement Fourier representation and its calculation circuits with trigonometric identities in Transformers. Considering the periodicity in modular arithmetic, the natural question is to what extent these explanations and interpretations hold for the grokking on other modular operations beyond addition. For a closer look, we first hypothesize that any modular operations can be characterized with distinctive Fourier representation or internal circuits, grokked models obtain common features transferable among similar operations, and mixing datasets with similar operations promotes grokking. Then, we extensively examine them by learning Transformers on complex modular arithmetic tasks, including polynomials. Our Fourier analysis and novel progress measure for modular arithmetic, Fourier Frequency Density and Fourier Coefficient Ratio, characterize distinctive internal representations of grokked models per modular operation; for instance, polynomials often result in the superposition of the Fourier components seen in elementary arithmetic, but clear patterns do not emerge in challenging non-factorizable polynomials. In contrast, our ablation study on the pre-grokked models reveals that the transferability among the models grokked with each operation can be only limited to specific combinations, such as from elementary arithmetic to linear expressions. Moreover, some multi-task mixtures may lead to co-grokking -- where grokking simultaneously happens for all the tasks -- and accelerate generalization, while others may not find optimal solutions. We provide empirical steps towards the interpretability of internal circuits.
Paper Structure (34 sections, 8 equations, 24 figures, 5 tables)

This paper contains 34 sections, 8 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: Grokking has been investigated with training from scratch. To shed light on the dynamics inside Transformer, we introduce the notion of pre-grokked models, which are pre-trained on a similar task until grokking and used to replace randomly initialized modules without any parameter updates (i.e. frozen). We use pre-grokked embedding and Transformer in the later section.
  • Figure 2: Test accuracy in modular elementary arithmetic (addition, subtraction, and multiplication) with pre-grokked models (embedding and Transformer). The x-axis is the logarithmic scale. Because of the task simplicity, grokking always occurs in elementary arithmetic. However, in certain combinations, pre-grokked models hinder grokking even with a $r=0.9$ fraction. For pre-grokked embedding, addition and subtraction accelerate grokking each other (fig[0:2, 0:2]), while multiplication and those do not show synergy ($+$: fig[2, 0] and [0, 2], $-$: fig[2, 1] and [1, 2]). In contrast, for pre-grokked Transformer, subtraction is challenging in both directions, even transferring subtraction models into subtraction itself (fig[1, 4]). With small $r$, addition and multiplication accelerate each other (fig[0, 5] and [2, 3]).
  • Figure 3: Frequency analysis in grokking with elementary arithmetic. Subtraction learns similar embedding to addition with sparse Fourier components (fig[0, 0] and fig[1, 0]). However, it imposes an asymmetric neuron-logit map and norm of logits with cosine biases (fig[1, 1] and fig[1, 2]). Multiplication obtains quite a different embedding from others (fig[2, :]); it employs all the frequencies equally with cosine bias for both embedding and neuron-logit map.
  • Figure 4: Test accuracy in modular polynomials (univariate terms: $a^2+b^2$, $a^2\pm b$, $a^3 \pm 2b$, the degree-1 with cross term: $ab+a+b$). Grokking occurs even in quadratic or cubic expressions asymmetric with input $a$ and $b$.
  • Figure 5: Frequency analysis in grokking with modular polynomials ($a^2+b^2$, $a^2-b$, $ab+a+b$). Grokking discovers the superposition of frequency sparsity and bias seen in elementary arithmetic; $a^2-b$ inherits both biased sparsity in subtraction and significant cosine biases in multiplication for embedding (fig[1,0]). Its neuron-logit map leverages addition-like sparsity (fig[1,1]).
  • ...and 19 more figures