Table of Contents
Fetching ...

Number Cookbook: Number Understanding of Language Models and How to Improve It

Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang

TL;DR

The paper defines a formal NUPA benchmark to diagnose numerical understanding in LLMs, spanning four number representations ($\text{Integer}$, $\text{Float}$, $\text{Fraction}$, $\text{Scientific Notation}$) and $17$ tasks across four ability categories, totaling $41$ task representations. It reveals that state-of-the-art models solve easy numerical problems but falter on longer inputs and non-integer formats, highlighting weaknesses in digit-level processing and length generalization. The authors explore three improvement directions—pretraining tweaks (tokenizers, PEs, number formats), finetuning on NUPA data, and chain-of-thought approaches (RF-CoT)—finding that naive finetuning can help some tasks, but the proposed NUPA-specific tricks often degrade performance when applied during finetuning, and RF-CoT offers speed- and window-length tradeoffs that limit practicality. They also show that one-digit tokenizers and regularized PEs generally enhance length generalization, while data formats aid digit alignment; nonetheless, comprehensive, scalable improvements remain elusive, motivating continued research and public release of the benchmark and code.

Abstract

Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 > 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and code are released at https://github.com/GraphPKU/number_cookbook.

Number Cookbook: Number Understanding of Language Models and How to Improve It

TL;DR

The paper defines a formal NUPA benchmark to diagnose numerical understanding in LLMs, spanning four number representations (, , , ) and tasks across four ability categories, totaling task representations. It reveals that state-of-the-art models solve easy numerical problems but falter on longer inputs and non-integer formats, highlighting weaknesses in digit-level processing and length generalization. The authors explore three improvement directions—pretraining tweaks (tokenizers, PEs, number formats), finetuning on NUPA data, and chain-of-thought approaches (RF-CoT)—finding that naive finetuning can help some tasks, but the proposed NUPA-specific tricks often degrade performance when applied during finetuning, and RF-CoT offers speed- and window-length tradeoffs that limit practicality. They also show that one-digit tokenizers and regularized PEs generally enhance length generalization, while data formats aid digit alignment; nonetheless, comprehensive, scalable improvements remain elusive, motivating continued research and public release of the benchmark and code.

Abstract

Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 > 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and code are released at https://github.com/GraphPKU/number_cookbook.

Paper Structure

This paper contains 43 sections, 20 figures, 17 tables.

Figures (20)

  • Figure 1: An example of metrics.
  • Figure 2: Parts of performance of state-of-the-art LLMs on NUPA benchmark.\ref{['foot:foot']} "-ft" denotes a Llama model we finetuned on these tasks. (See Section \ref{['ssec:finetune']})
  • Figure 3: Different tokenization of a long number. (a) GPT2: mixed digit tokenizer, (b) Llama-2: one-digit tokenizer. (c) GPT-3.5, GPT-4 and Llama-3: three-digit tokenizer. (d) Aligned three-digit tokenizer.
  • Figure 4: Accuracy of 0.9B models trained with 1-3 digit tokenizer on three tasks of integer addition, float addition and integer multiplication. Shadow shows the standard error. D$n$ means $n$ digits. X-axis is the number of seen training samples.
  • Figure 5: Exact match of models tested on NUPA Test.
  • ...and 15 more figures