Table of Contents
Fetching ...

LLMs Know More About Numbers than They Can Say

Fengting Yuchi, Li Du, Jason Eisner

TL;DR

The paper investigates whether LLMs truly understand numeric magnitudes across decimal and scientific notation and whether this understanding translates into verbalizable answers. Using linear probes, the authors show that internal representations encode $\log_2$ magnitudes and can recover numeral values, while a separate classifier can internally compare numbers; however, verbalization of cross-notation comparisons is only 50–70% accurate for open 7B–8B models. Finetuning that couples the LM objective with a probing loss improves verbalized numeracy by about 3.22%, suggesting a causal link between richer internal magnitude representations and generation quality. These results imply a practical path to enhance numeracy in LLMs by targeting internal representations during training, with potential benefits for scientific and numerical reasoning tasks.

Abstract

Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities.

LLMs Know More About Numbers than They Can Say

TL;DR

The paper investigates whether LLMs truly understand numeric magnitudes across decimal and scientific notation and whether this understanding translates into verbalizable answers. Using linear probes, the authors show that internal representations encode magnitudes and can recover numeral values, while a separate classifier can internally compare numbers; however, verbalization of cross-notation comparisons is only 50–70% accurate for open 7B–8B models. Finetuning that couples the LM objective with a probing loss improves verbalized numeracy by about 3.22%, suggesting a causal link between richer internal magnitude representations and generation quality. These results imply a practical path to enhance numeracy in LLMs by targeting internal representations during training, with potential benefits for scientific and numerical reasoning tasks.

Abstract

Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, or ?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities.
Paper Structure (32 sections, 4 equations, 14 figures, 2 tables)

This paper contains 32 sections, 4 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Scatterplots of predicted vs. true ("golden") log-magnitudes for Mistral-7B (32 layers) across different datasets and notation types. "Synth" refers to the synthetic cross-notation data we constructed in \ref{['sec:experimental-setup']}. Dec/sci/mixed conditions train and test on decimals only, scientific notation numerals only, and both, respectively. We train and test a probe at each layer, and then plot performance on held-out test data for only the probe that achieved the highest $R^2$ on held-out validation data. The parenthesized number is the layer index of that probe.
  • Figure 2: MSEs in log-space of regression probes on cross-notation data of each LLM across layers.
  • Figure 3: Accuracy of cross-notation comparison ($a \stackrel{?}{>} b$) versus the relative magnitude of the two numbers ($\log_2(a/b)$). (a) Verbalized comparison using one-shot prompting; (b) Comparison of the two values predicted by regression; (c) Comparison via a classification probe; (d) Comparison via the log-ratio predicted by regression. Note that (b)--(d) require access to the hidden states, so they do not include the large closed-source models GPT-4.1 and GPT-4.1-mini. See \ref{['fig:log-ratio-plot-individual-models']} for individual models' results.
  • Figure 4: Logistic classifier accuracies on cross-notation of each LLM across layers.
  • Figure 5: Average performance of linear regression probes at the first 3 layers correlates with model's verbalization accuracy on cross-notation.
  • ...and 9 more figures