Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities

Roussel Rahman

Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities

Roussel Rahman

TL;DR

This study introduces Numberland, a 100-problem test designed to probe LLMs' elementary numerical reasoning and number sense by separating deterministic arithmetic tasks from a trial-and-error game (the 24 game). Across five agents, deterministic tasks are solved with high accuracy (e.g., $o1$ at $\approx$90% overall), but performance collapses on the 24 game, highlighting a fragile numerical intuition and limited search capabilities in current LLMs. The authors argue that this gap reflects bounded rationality-like constraints and propose a multi-level evaluation framework, reinforced by mechanistic interpretability, to understand and improve AI reasoning. Follow-up and newer-model results indicate that trial-and-error search remains a persistent challenge, underscoring the need for targeted evaluation, robust explanations, and safer deployment strategies for AI math skills. The work advocates targeted, explainable testing to bridge the gap between impressive benchmarks and reliable, safe numerical problem solving in real-world use.

Abstract

An essential element of human mathematical reasoning is our number sense -- an abstract understanding of numbers and their relationships -- which allows us to solve problems involving vast number spaces using limited computational resources. Mathematical reasoning of Large Language Models (LLMs) is often tested on high-level problems (such as Olympiad challenges, geometry, word problems, and puzzles), but their low-level number sense remains less explored. We introduce "Numberland," a 100-problem test to evaluate the numerical reasoning abilities of LLM-based agents. The tasks -- basic operations, advanced calculations (e.g., exponentiation, complex numbers), prime number checks, and the 24 game -- aim to test elementary skills and their integration in solving complex and uncertain problems. We evaluated five LLM-based agents: OpenAI's o1 and o1-mini, Google Gemini, Microsoft Copilot, and Anthropic Claude. They scored 74-95% on the first three tasks that allow deterministic steps to solutions. In the 24 game, which needs trial-and-error search, performance dropped to 10-73%. We tested the top 24 solver (o1 with 73% accuracy) on 25 harder problems, and its score fell to 27%, confirming search as a bottleneck. These results, along with the types of mistakes, suggest a fragile number of LLMs, which is a bit surprising given their prowess in challenging benchmarks. The limits of LLM numerical reasoning highlight the scope of simple, targeted tests to evaluate and explain LLM math skills to ensure safe use.

Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities

TL;DR

Abstract

Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)