Table of Contents
Fetching ...

StreetMath: Study of LLMs' Approximation Behaviors

Chiung-Yi Tseng, Somshubhra Roy, Maisha Thasin, Danyang Zhang, Blessing Effiong

TL;DR

The paper investigates whether LLMs can engage in context-appropriate street-math approximation or are biased toward exact arithmetic. It introduces StreetMath, a 1000-problem benchmark spanning common everyday calculations, and analyzes multiple model families using linear probes, causal pruning, and layer-wise diagnostics to distinguish exact versus approximate reasoning. Findings show a pervasive bias toward exact computation across architectures, with approximate answers often produced only after computing the exact value and at higher token costs, suggesting distinct neural substrates for the two modes. The work highlights a gap between human-like cognitive miserliness and current LLM capabilities, motivating future work on architectural and training strategies to enable flexible, efficiency-driven approximation.

Abstract

There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models' approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath

StreetMath: Study of LLMs' Approximation Behaviors

TL;DR

The paper investigates whether LLMs can engage in context-appropriate street-math approximation or are biased toward exact arithmetic. It introduces StreetMath, a 1000-problem benchmark spanning common everyday calculations, and analyzes multiple model families using linear probes, causal pruning, and layer-wise diagnostics to distinguish exact versus approximate reasoning. Findings show a pervasive bias toward exact computation across architectures, with approximate answers often produced only after computing the exact value and at higher token costs, suggesting distinct neural substrates for the two modes. The work highlights a gap between human-like cognitive miserliness and current LLM capabilities, motivating future work on architectural and training strategies to enable flexible, efficiency-driven approximation.

Abstract

There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models' approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath

Paper Structure

This paper contains 28 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Accuracy per layer across models for digits paraphrase and words tasks with near parameters 5 and 10.
  • Figure 2: Effect of structured pruning on task performance for all models. Accuracy is plotted against the proportion of parameters pruned for StreetMath and GSM8K benchmarks.
  • Figure 3: Comparative Layerwise Average Summary for Qwen3-4B-Instruct-2507 on GSM8K vs StreetMath
  • Figure 4: Comparative Layerwise Average Summary for Qwen3-4B-Thinking-2507 on GSM8K vs StreetMath
  • Figure 5: Comparative Layerwise Average Summary for Dream-v0-Instruct-7B on GSM8K on GSM8K vs StreetMath
  • ...and 2 more figures