Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

Zhejian Zhou; Jiayu Wang; Dahua Lin; Kai Chen

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

Zhejian Zhou, Jiayu Wang, Dahua Lin, Kai Chen

TL;DR

It is empirically shown that a base $10$ system is consistently more data-efficient than a base $10^{2}$ or $10^{3}$ system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances.

Abstract

Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into $1$-digit, and 2) Tokenize into $1\sim 3$ digit. The difference is roughly equivalent to using different numeral systems (namely base $10$ or base $10^{3}$). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base $10$ system is consistently more data-efficient than a base $10^{2}$ or $10^{3}$ system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances. We attribute this to higher token frequencies of a base $10$ system. Additionally, we reveal extrapolation behavior patterns on addition and multiplication. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations. We also sheds light on the mechanism learnt by the models.

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

TL;DR

It is empirically shown that a base

system is consistently more data-efficient than a base

system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances.

Abstract

-digit, and 2) Tokenize into

digit. The difference is roughly equivalent to using different numeral systems (namely base

or base

). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base

system is consistently more data-efficient than a base

system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances. We attribute this to higher token frequencies of a base

system. Additionally, we reveal extrapolation behavior patterns on addition and multiplication. We identify that base

and base

systems struggle on token-level discernment and token-level operations. We also sheds light on the mechanism learnt by the models.

Paper Structure (24 sections, 10 figures, 4 tables)

This paper contains 24 sections, 10 figures, 4 tables.

Introduction
Related Work
Scaling Behavior Experiment Designs
Synthetic Data Generation
Evaluation Setup
Relative Error
Normalized Edit Similarity
Experiments and Results
Overall Trends
In-domain Interpolation Evaluation
Addition
Multiplication
Out-of-domain Extrapolation Evaluation
Addition
Multiplication
...and 9 more sections

Figures (10)

Figure 1: Answer Token Distribution for Multiplication. We sample $2^{13}$ addition samples to illustrate the distribution. Token values are normalized to $[0,1]$.
Figure 2: Relative Error (lower is better) and Normalized Edit Similarity (higher is better) for addition operation with different data scales, model parameter sizes, from-scratch or fine-tune, and numeral systems.
Figure 3: Exact match accuracy for addition operation with different data scales, model parameter sizes, from-scratch or fine-tune, and numeral systems.
Figure 4: Relative Error and Normalized Edit Similarity for multiplication operation with different data scales, model parameter sizes, from-scratch or fine-tune, and numeral systems.
Figure 5: Relative Error Matrix for Extrapolation Behavior Analysis. The results are obtained using a 1.4B model fine-tuned on $2^{19}$ training samples.
...and 5 more figures

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

TL;DR

Abstract

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

Authors

TL;DR

Abstract

Table of Contents

Figures (10)