Table of Contents
Fetching ...

Pre-trained Large Language Models Use Fourier Features to Compute Addition

Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia

TL;DR

The paper reveals that pre-trained large language models compute addition not by memorization but through Fourier features embedded in hidden representations. It shows a division of labor where MLPs primarily approximate the magnitude using low-frequency components, while attention modules perform modular addition using high-frequency components, with both components ultimately summing to yield the correct result. Pre-training is shown to be crucial, as models trained from scratch lack these Fourier features unless pre-trained embeddings are provided; this inductive bias also transfers to in-context learning. The findings offer a mechanistic, frequency-domain view of arithmetic in Transformer models and suggest how pre-training shapes capabilities for algorithmic tasks, with implications for prompting and model design.

Abstract

Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.

Pre-trained Large Language Models Use Fourier Features to Compute Addition

TL;DR

The paper reveals that pre-trained large language models compute addition not by memorization but through Fourier features embedded in hidden representations. It shows a division of labor where MLPs primarily approximate the magnitude using low-frequency components, while attention modules perform modular addition using high-frequency components, with both components ultimately summing to yield the correct result. Pre-training is shown to be crucial, as models trained from scratch lack these Fourier features unless pre-trained embeddings are provided; this inductive bias also transfers to in-context learning. The findings offer a mechanistic, frequency-domain view of arithmetic in Transformer models and suggest how pre-training shapes capabilities for algorithmic tasks, with implications for prompting and model design.

Abstract

Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.
Paper Structure (49 sections, 10 equations, 22 figures, 1 table)

This paper contains 49 sections, 10 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: (a) Visualization of predictions extracted from fine-tuned GPT-2-XL at intermediate layers. Between layers 20 and 30, the model's accuracy is low, but its prediction is often within 10 of the correct answer: the model first approximates the answer, then refines it. (b) Heatmap of the logits from different MLP layers for the running example, "Put together 15 and 93. Answer: 108". The $y$-axis represents the subset of the number space around the correct prediction, while the $x$-axis represents the layer index. The $33$-rd layer performs$\textrm{ mod } 2$ operations (favoring even numbers), while other layers perform other modular addition operations, such as$\textrm{ mod } 10$ ($45$-th layer). Additionally, most layers allocate more weight to numbers closer to the correct answer, $108$. (c) Analogous plot for attention layers. Nearly all attention modules perform modular addition.
  • Figure 2: The intermediate logits in Fourier space. We annotate the top-$10$ outlier high-frequency Fourier components based on their magnitudes. $T$ stands for the period of that Fourier component. (a) The logits in Fourier space for the MLP output of the $33$-rd layer, i.e., $\widehat{\mathcal{L}}_{\mathrm{MLP}}^{(33)}$. The component with period $2$ has the largest magnitude, aligning with the observations in Figures \ref{['fig:error_accuracy_skip_logit_lens']}b and \ref{['fig:mlp_logit_wave']}a. (b) The logits in Fourier space for the attention output of the $40$-th layer, i.e., $\widehat{\mathcal{L}}_{\mathrm{Attn}}^{(40)}$. The components with periods $5$ and $10$ have the largest magnitude, aligning with the observations in Figures \ref{['fig:error_accuracy_skip_logit_lens']}c and \ref{['fig:mlp_logit_wave']}b.
  • Figure 3: Analysis of logits in Fourier space for all the test data across the last $15$ layers. For both the MLP and attention modules, outlier Fourier components have periods around $2$, $2.5$, $5$, and $10$.repeat what we discover here
  • Figure 4: Visualization of how a sparse subset Fourier components can identify the correct answer. (a) Shows the top-$5$ Fourier components for the final logits. (b) Shows the sum of these top-$5$ Fourier components, highlighting how the cumulative effect identifies the correct answer, $108$.
  • Figure 5: (a) Number embedding in Fourier space for fine-tuned GPT-2-XL. $T$ stands for the period of that Fourier component.(b) Visualization of token embedding clustering of GPT-2 using T-SNE and $k$-means with $10$ clusters. The numbers are clustered based on their magnitude and whether they are multiples of $10$.
  • ...and 17 more figures

Theorems & Definitions (9)

  • Definition A.1: Transformer
  • Definition A.2: Intermediate Logits
  • Definition A.3: Fourier Basis
  • Remark A.4: Discrete Fourier transformer (DFT) and inverse DFT
  • Definition A.5: Logits in Fourier Space
  • Definition A.6: Loss-pass / High-pass Filter
  • Remark A.7
  • Definition B.1: Single-Pass Filter
  • Remark B.2