Table of Contents
Fetching ...

Laying Anchors: Semantically Priming Numerals in Language Modeling

Mandar Sharma, Rutuja Murlidhar Taware, Pravesh Koirala, Nikhil Muralidhar, Naren Ramakrishnan

TL;DR

The paper tackles the limited numeric comprehension of off-the-shelf language models by introducing anchor-based numeral grounding during pre-training. It derives anchors from numeral distributions in the training corpus using Gaussian Mixture Models and augments numerals with priming tokens to create linear and compressive representations, including directional cues. Across in-domain and out-of-domain numerals, the proposed Anchors, ln Anchors, and their directional variants substantially improve magnitude estimation and relative ordering, outperforming strong baselines like GenBERT and MWP-BERT and extending evaluation up to numerals as large as $10^{10}$. The approach offers a simple, plug-and-play enhancement with broad practical implications for numeracy in downstream NLP tasks, while acknowledging resource limitations and providing an ethics-aware evaluation.

Abstract

Off-the-shelf pre-trained language models have become the de facto standard in NLP pipelines for a multitude of downstream tasks. However, the inability of these models to properly encode numerals limits their performance on tasks requiring numeric comprehension. We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus, thereby enabling mathematically grounded representations of these numeral tokens. We establish the superiority of our proposed techniques through evaluation on a range of numeracy tasks for both in-domain (seen) and out-domain (unseen) numerals. Further, we expand our empirical evaluations to numerals ranging from 1 to 10 billion, a significantly broader range compared to previous studies of the same nature, and we demonstrate significant improvements in the mathematical grounding of our learned embeddings.

Laying Anchors: Semantically Priming Numerals in Language Modeling

TL;DR

The paper tackles the limited numeric comprehension of off-the-shelf language models by introducing anchor-based numeral grounding during pre-training. It derives anchors from numeral distributions in the training corpus using Gaussian Mixture Models and augments numerals with priming tokens to create linear and compressive representations, including directional cues. Across in-domain and out-of-domain numerals, the proposed Anchors, ln Anchors, and their directional variants substantially improve magnitude estimation and relative ordering, outperforming strong baselines like GenBERT and MWP-BERT and extending evaluation up to numerals as large as . The approach offers a simple, plug-and-play enhancement with broad practical implications for numeracy in downstream NLP tasks, while acknowledging resource limitations and providing an ethics-aware evaluation.

Abstract

Off-the-shelf pre-trained language models have become the de facto standard in NLP pipelines for a multitude of downstream tasks. However, the inability of these models to properly encode numerals limits their performance on tasks requiring numeric comprehension. We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus, thereby enabling mathematically grounded representations of these numeral tokens. We establish the superiority of our proposed techniques through evaluation on a range of numeracy tasks for both in-domain (seen) and out-domain (unseen) numerals. Further, we expand our empirical evaluations to numerals ranging from 1 to 10 billion, a significantly broader range compared to previous studies of the same nature, and we demonstrate significant improvements in the mathematical grounding of our learned embeddings.
Paper Structure (17 sections, 1 equation, 3 figures, 2 tables)

This paper contains 17 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Anchor-based embeddings correlate significantly better to the number line: The plot above showcases how well the numeral embeddings from the baselines and our model (Anchors) correlate to the number line with their $R^{2}$ goodness-of-fit scores presented. The numeral range [1,10k] is employed for this plot as it contains a healthy mixture of both in-domain and out-domain numerals from our dataset.
  • Figure 2: How are the numerals in the training corpus primed? Showcasing samples from the training corpus - as-is, primed with simple anchors <ANC>where each numeral in the sample is augmented with the its closest anchor, and directional anchors <LA>/<RA>where the direction of the anchor with respect to the numeral (left or right in the number-line) is also embedded.
  • Figure 3: Heatmaps computed from cosine similarities of numeral embeddings in range [1,100].