Exploring Internal Numeracy in Language Models: A Case Study on ALBERT

Ulme Wennberg; Gustav Eje Henter

Exploring Internal Numeracy in Language Models: A Case Study on ALBERT

Ulme Wennberg, Gustav Eje Henter

TL;DR

We address the problem of understanding how numeracy emerges in transformer language models trained purely on text. The authors propose a PCA-based probe that analyzes uncontextualized numerals and ordinal embeddings from ALBERT variants to reveal principal axes that encode numeric ordering and magnitude. They find that the primary axes reflect numeric value, with digits and their word forms forming separate clusters yet moving along a shared direction, and that larger numbers show compressed spacing, suggesting logarithmic-like scaling. This work provides a direct embedding-level window into emergent numeracy and a general methodology for exploring internal numerical cognition with potential applications in quantitative reasoning for NLP tasks.

Abstract

It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.

Exploring Internal Numeracy in Language Models: A Case Study on ALBERT

TL;DR

Abstract

Paper Structure (12 sections, 4 figures)

This paper contains 12 sections, 4 figures.

Introduction
Background
Experiments
Analysis Methodology
Numerical vs. Lexical Embedding
Numbers 1 Through 100
Representing Orders of Magnitude
Words for Ordinals
Discussion
Conclusion
Acknowledgments
Bibliographical References

Figures (4)

Figure 1: Visualization of the two first principal components of word embeddings for numbers zero through twenty and their textual counterparts in two ALBERT models.
Figure 2: The first and second PCA components for all numbers 1 to 100 in two different ALBERT models.
Figure 3: Orders-of-magnitude word embeddings visualized along the first PCA axis across eight ALBERT configurations. Axes have been affinely transformed so that the first and last embeddings line up vertically. The last row shows the concepts arranged on a logarithmic axis for comparison.
Figure 4: Visualization of ordinal term embeddings along the first PCA axis across eight ALBERT configurations. The axes have been affinely transformed so that the first and last embeddings line up vertically. The last row shows the concepts arranged on a logarithmic axis for comparison.

Exploring Internal Numeracy in Language Models: A Case Study on ALBERT

TL;DR

Abstract

Exploring Internal Numeracy in Language Models: A Case Study on ALBERT

Authors

TL;DR

Abstract

Table of Contents

Figures (4)