Pre-trained Large Language Models Use Fourier Features to Compute Addition

Tianyi Zhou; Deqing Fu; Vatsal Sharan; Robin Jia

Pre-trained Large Language Models Use Fourier Features to Compute Addition

Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia

TL;DR

The paper reveals that pre-trained large language models compute addition not by memorization but through Fourier features embedded in hidden representations. It shows a division of labor where MLPs primarily approximate the magnitude using low-frequency components, while attention modules perform modular addition using high-frequency components, with both components ultimately summing to yield the correct result. Pre-training is shown to be crucial, as models trained from scratch lack these Fourier features unless pre-trained embeddings are provided; this inductive bias also transfers to in-context learning. The findings offer a mechanistic, frequency-domain view of arithmetic in Transformer models and suggest how pre-training shapes capabilities for algorithmic tasks, with implications for prompting and model design.

Abstract

Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.

Pre-trained Large Language Models Use Fourier Features to Compute Addition

TL;DR

Abstract

Paper Structure (49 sections, 10 equations, 22 figures, 1 table)

This paper contains 49 sections, 10 equations, 22 figures, 1 table.

Introduction
Problem Setup
Task and Dataset.
Model.
Transformers.
Language Models Solve Addition with Fourier Features
Behavioral Analysis
Extracting intermediate predictions.
LLMs progressively compute the final answers.
Fourier Features in MLP & Attention Outputs
Logits for MLP and attention have periodic structures.
Logits for MLP and attention are approximately sparse in the Fourier space.
Final logits are superpositions of these outlier Fourier components.
Why is high-frequency classification helpful?
Fourier Features are Causally Important for Model Predictions
...and 34 more sections

Figures (22)

Figure 1: (a) Visualization of predictions extracted from fine-tuned GPT-2-XL at intermediate layers. Between layers 20 and 30, the model's accuracy is low, but its prediction is often within 10 of the correct answer: the model first approximates the answer, then refines it. (b) Heatmap of the logits from different MLP layers for the running example, "Put together 15 and 93. Answer: 108". The $y$-axis represents the subset of the number space around the correct prediction, while the $x$-axis represents the layer index. The $33$-rd layer performs$\textrm{ mod } 2$ operations (favoring even numbers), while other layers perform other modular addition operations, such as$\textrm{ mod } 10$ ($45$-th layer). Additionally, most layers allocate more weight to numbers closer to the correct answer, $108$. (c) Analogous plot for attention layers. Nearly all attention modules perform modular addition.
Figure 2: The intermediate logits in Fourier space. We annotate the top-$10$ outlier high-frequency Fourier components based on their magnitudes. $T$ stands for the period of that Fourier component. (a) The logits in Fourier space for the MLP output of the $33$-rd layer, i.e., $\widehat{\mathcal{L}}_{\mathrm{MLP}}^{(33)}$. The component with period $2$ has the largest magnitude, aligning with the observations in Figures \ref{['fig:error_accuracy_skip_logit_lens']}b and \ref{['fig:mlp_logit_wave']}a. (b) The logits in Fourier space for the attention output of the $40$-th layer, i.e., $\widehat{\mathcal{L}}_{\mathrm{Attn}}^{(40)}$. The components with periods $5$ and $10$ have the largest magnitude, aligning with the observations in Figures \ref{['fig:error_accuracy_skip_logit_lens']}c and \ref{['fig:mlp_logit_wave']}b.
Figure 3: Analysis of logits in Fourier space for all the test data across the last $15$ layers. For both the MLP and attention modules, outlier Fourier components have periods around $2$, $2.5$, $5$, and $10$.repeat what we discover here
Figure 4: Visualization of how a sparse subset Fourier components can identify the correct answer. (a) Shows the top-$5$ Fourier components for the final logits. (b) Shows the sum of these top-$5$ Fourier components, highlighting how the cumulative effect identifies the correct answer, $108$.
Figure 5: (a) Number embedding in Fourier space for fine-tuned GPT-2-XL. $T$ stands for the period of that Fourier component.(b) Visualization of token embedding clustering of GPT-2 using T-SNE and $k$-means with $10$ clusters. The numbers are clustered based on their magnitude and whether they are multiples of $10$.
...and 17 more figures

Theorems & Definitions (9)

Definition A.1: Transformer
Definition A.2: Intermediate Logits
Definition A.3: Fourier Basis
Remark A.4: Discrete Fourier transformer (DFT) and inverse DFT
Definition A.5: Logits in Fourier Space
Definition A.6: Loss-pass / High-pass Filter
Remark A.7
Definition B.1: Single-Pass Filter
Remark B.2

Pre-trained Large Language Models Use Fourier Features to Compute Addition

TL;DR

Abstract

Pre-trained Large Language Models Use Fourier Features to Compute Addition

Authors

TL;DR

Abstract

Table of Contents

Figures (22)

Theorems & Definitions (9)