Table of Contents
Fetching ...

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models

Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, Xiang Ren

TL;DR

NumerSense tackles whether pre-trained language models encode numerical commonsense by constructing a targeted probing task and dataset of 3,145 masked-number probes. It shows that standard models like BERT and RoBERTa struggle to recall numerical facts, even with distant supervision, and remain brittle under adversarial perturbations. The paper analyzes data properties, model behavior, and implications for open-domain QA, arguing for targeted pre-training and inductive biases to better capture numerical knowledge. Together, it provides a benchmark and insights to guide future improvements in numerical reasoning for language models.

Abstract

Recent works show that pre-trained language models (PTLMs), such as BERT, possess certain commonsense and factual knowledge. They suggest that it is promising to use PTLMs as "neural knowledge bases" via predicting masked words. Surprisingly, we find that this may not work for numerical commonsense knowledge (e.g., a bird usually has two legs). In this paper, we investigate whether and to what extent we can induce numerical commonsense knowledge from PTLMs as well as the robustness of this process. To study this, we introduce a novel probing task with a diagnostic dataset, NumerSense, containing 13.6k masked-word-prediction probes (10.5k for fine-tuning and 3.1k for testing). Our analysis reveals that: (1) BERT and its stronger variant RoBERTa perform poorly on the diagnostic dataset prior to any fine-tuning; (2) fine-tuning with distant supervision brings some improvement; (3) the best supervised model still performs poorly as compared to human performance (54.06% vs 96.3% in accuracy).

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models

TL;DR

NumerSense tackles whether pre-trained language models encode numerical commonsense by constructing a targeted probing task and dataset of 3,145 masked-number probes. It shows that standard models like BERT and RoBERTa struggle to recall numerical facts, even with distant supervision, and remain brittle under adversarial perturbations. The paper analyzes data properties, model behavior, and implications for open-domain QA, arguing for targeted pre-training and inductive biases to better capture numerical knowledge. Together, it provides a benchmark and insights to guide future improvements in numerical reasoning for language models.

Abstract

Recent works show that pre-trained language models (PTLMs), such as BERT, possess certain commonsense and factual knowledge. They suggest that it is promising to use PTLMs as "neural knowledge bases" via predicting masked words. Surprisingly, we find that this may not work for numerical commonsense knowledge (e.g., a bird usually has two legs). In this paper, we investigate whether and to what extent we can induce numerical commonsense knowledge from PTLMs as well as the robustness of this process. To study this, we introduce a novel probing task with a diagnostic dataset, NumerSense, containing 13.6k masked-word-prediction probes (10.5k for fine-tuning and 3.1k for testing). Our analysis reveals that: (1) BERT and its stronger variant RoBERTa perform poorly on the diagnostic dataset prior to any fine-tuning; (2) fine-tuning with distant supervision brings some improvement; (3) the best supervised model still performs poorly as compared to human performance (54.06% vs 96.3% in accuracy).

Paper Structure

This paper contains 14 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Top: PTLMs often cannot solve masked language modeling tasks needing numerical commonsense knowledge, hence our title. Bottom: Even when PTLMs seemingly succeed, they fail to stay consistent under small perturbations.
  • Figure 1: NumerSense examples of each category.
  • Figure 2: Truth number distribution of the training set.
  • Figure 3: Truth number distribution of the test set.
  • Figure 3: The average Softmax of top 3 predictions in templates where '[x]' is filled with 1k random words.
  • ...and 2 more figures