Table of Contents
Fetching ...

How to Leverage Digit Embeddings to Represent Numbers?

Jasivan Alex Sivakumar, Nafise Sadat Moosavi

TL;DR

This work tackles the persistent challenge of numerical reasoning in language models by proposing an explicit, mathematically grounded aggregation of digit embeddings to represent numbers. A weighted aggregation computes number embeddings from digits, designed to preserve single-digit compatibility, reflect base-10 positional magnitude, and avoid normalisation that would erase length differences. The authors implement two model-agnostic integration strategies—input-level aggregation via a dedicated [AGG] token and an auxiliary loss that shapes aggregated number representations—and evaluate on MAWPS and FERMAT datasets, showing improvements for smaller models and nuanced dependence on model size and integration method. The findings suggest that explicit digit aggregation can enhance number understanding without architectural changes or extensive pretraining, with clear avenues for future work including scalability to larger models and extensions to decimals.

Abstract

Within numerical reasoning, understanding numbers themselves is still a challenge for existing language models. Simple generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance (Sivakumar and Moosavi, 2023). Among various techniques, character-level embeddings of numbers have emerged as a promising approach to improve number representation. However, this method has limitations as it leaves the task of aggregating digit representations to the model, which lacks direct supervision for this process. In this paper, we explore the use of mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models. This can be achieved either by adding a special token to the input embeddings or by introducing an additional loss function to enhance correct predictions. We evaluate the effectiveness of incorporating this explicit aggregation, analysing its strengths and shortcomings, and discuss future directions to better benefit from this approach. Our methods, while simple, are compatible with any pretrained model, easy to implement, and have been made publicly available.

How to Leverage Digit Embeddings to Represent Numbers?

TL;DR

This work tackles the persistent challenge of numerical reasoning in language models by proposing an explicit, mathematically grounded aggregation of digit embeddings to represent numbers. A weighted aggregation computes number embeddings from digits, designed to preserve single-digit compatibility, reflect base-10 positional magnitude, and avoid normalisation that would erase length differences. The authors implement two model-agnostic integration strategies—input-level aggregation via a dedicated [AGG] token and an auxiliary loss that shapes aggregated number representations—and evaluate on MAWPS and FERMAT datasets, showing improvements for smaller models and nuanced dependence on model size and integration method. The findings suggest that explicit digit aggregation can enhance number understanding without architectural changes or extensive pretraining, with clear avenues for future work including scalability to larger models and extensions to decimals.

Abstract

Within numerical reasoning, understanding numbers themselves is still a challenge for existing language models. Simple generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance (Sivakumar and Moosavi, 2023). Among various techniques, character-level embeddings of numbers have emerged as a promising approach to improve number representation. However, this method has limitations as it leaves the task of aggregating digit representations to the model, which lacks direct supervision for this process. In this paper, we explore the use of mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models. This can be achieved either by adding a special token to the input embeddings or by introducing an additional loss function to enhance correct predictions. We evaluate the effectiveness of incorporating this explicit aggregation, analysing its strengths and shortcomings, and discuss future directions to better benefit from this approach. Our methods, while simple, are compatible with any pretrained model, easy to implement, and have been made publicly available.
Paper Structure (20 sections, 3 equations, 3 figures, 5 tables)

This paper contains 20 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A 2D projection of the neighbourhood of the number token “55” in FLAN large is represented on the left. Ideally, number embeddings should reflect natural numerical proximity. In other words, the embedding for any given number should closely align with those of its immediate numerical neighbours, depicted on the right.
  • Figure 2: Average F1-score of FLAN large layer 1 numbers using sum and our weighted aggregation function with neighbourhood of 10.
  • Figure 3: Average F1-score of FLAN large layer 1 numbers using max, min, median, mean sum and our weighted aggregation function with neighbourhood of 10.