Table of Contents
Fetching ...

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan

TL;DR

FoNE introduces a principled, Fourier-based single-token embedding for numbers in LLMs, encoding each digit with two dimensions across multiple periods T_i = 10^i to yield exact recoverability of residues x mod 10^i. A dedicated decoding head maps last-layer representations from Fourier space back to digits, enabling accurate arithmetic while bypassing token fragmentation. Empirical results demonstrate superior data and parameter efficiency, including perfect accuracy on key arithmetic tasks with far less data and smaller models than baselines, and faster training/inference due to one-token-per-number input. The work also shows FoNE’s compatibility with existing embeddings (e.g., Abacus) and discusses extensions to longer numbers and pretraining integration, highlighting significant practical impact for numerical reasoning in LLMs.

Abstract

Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.

FoNE: Precise Single-Token Number Embeddings via Fourier Features

TL;DR

FoNE introduces a principled, Fourier-based single-token embedding for numbers in LLMs, encoding each digit with two dimensions across multiple periods T_i = 10^i to yield exact recoverability of residues x mod 10^i. A dedicated decoding head maps last-layer representations from Fourier space back to digits, enabling accurate arithmetic while bypassing token fragmentation. Empirical results demonstrate superior data and parameter efficiency, including perfect accuracy on key arithmetic tasks with far less data and smaller models than baselines, and faster training/inference due to one-token-per-number input. The work also shows FoNE’s compatibility with existing embeddings (e.g., Abacus) and discusses extensions to longer numbers and pretraining integration, highlighting significant practical impact for numerical reasoning in LLMs.

Abstract

Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64 less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3 and 6 fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.

Paper Structure

This paper contains 44 sections, 5 theorems, 22 equations, 13 figures, 8 tables, 1 algorithm.

Key Result

Lemma 3.3

Given the pair $\left(\cos\left(\tfrac{2\pi}{T}x\right), \sin\left(\tfrac{2\pi}{T}x\right)\right)$, we can recover $x \bmod T$.

Figures (13)

  • Figure 1: (a) We extract all the numbers from the input sequence. (b) For each number, we use FoNE to directly map the number to its embedding. The first two entries in the embedding represent $18 \bmod 10$, while the next two entries represent $18 \bmod 100$. (c) We pad the FoNE with zeros, add it to the word embeddings, and then feed the combined embeddings into the model. (d) For each digit, we take every two entries from the last hidden state and find the number whose representation is closest to these two entries.
  • Figure 2: We train Llama-3.2-1B from scratch with random initialization using different number embedding methods on 6-digit decimal addition. The test accuracy is compared across varying data sizes and model sizes.
  • Figure 3: Comparison of accuracy trends for various arithmetic tasks with respect to model size and data size.
  • Figure 4: (a) Average accuracy of an 8-layer transformer model on 60-digit addition tasks using FoNE for chunked input. (b) Performance improvements achieved by combining FoNE with the Abacus embedding method across various random seeds. The transformer is trained on addition tasks with up to 10-digits numbers (represented by the smaller square) and tested up to 50-digit numbers.
  • Figure 5: We train Llama-3.2-1B from scratch with random initialization using different number embedding methods on number classification where $d=10$. The test accuracy is compared across varying data sizes and model sizes.
  • ...and 8 more figures

Theorems & Definitions (12)

  • Definition 3.1: Circular embedding
  • Definition 3.2: Fourier Number Embedding
  • Lemma 3.3: Informal version of Lemma \ref{['lem:fne_preserve_numeracy:formal']}
  • Lemma 3.4: FoNE preserves numeracy
  • Lemma 3.5: Necessity of different periods
  • Example 3.6
  • Definition 3.7: Fourier Number Loss Function
  • Definition 3.8: Fourier Number Prediction for the $i$-th digit
  • Lemma C.1: Formal version of Lemma \ref{['lem:fne_preserve_numeracy:informal']}
  • proof
  • ...and 2 more