Table of Contents
Fetching ...

BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models

Jacek Wiland, Max Ploner, Alan Akbik

TL;DR

BEAR presents a log-likelihood based probing framework that unifies relational knowledge evaluation across masked and causal LMs, addressing LAMA's limitations such as single-token answers, MLM dependence, and data biases. By constructing a large, balanced BEAR dataset with 60 relations and 7,731 items (plus a larger BEARbig variant), BEAR tests models with multiple candidate statements per instance and ranks them using the LM's pseudo log-likelihood, enabling fair cross-LM comparisons. Evaluations across 22 LMs show model size correlates with BEAR performance, with masked LMs often performing slightly better than causal ones of similar scale, while template choices and pre-training data influence outcomes. The work provides an open-source framework and datasets to facilitate ongoing development of factual knowledge extraction in LMs, with implications for evaluating and guiding improvements in both causal and masked architectures.

Abstract

Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.

BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models

TL;DR

BEAR presents a log-likelihood based probing framework that unifies relational knowledge evaluation across masked and causal LMs, addressing LAMA's limitations such as single-token answers, MLM dependence, and data biases. By constructing a large, balanced BEAR dataset with 60 relations and 7,731 items (plus a larger BEARbig variant), BEAR tests models with multiple candidate statements per instance and ranks them using the LM's pseudo log-likelihood, enabling fair cross-LM comparisons. Evaluations across 22 LMs show model size correlates with BEAR performance, with masked LMs often performing slightly better than causal ones of similar scale, while template choices and pre-training data influence outcomes. The work provides an open-source framework and datasets to facilitate ongoing development of factual knowledge extraction in LMs, with implications for evaluating and guiding improvements in both causal and masked architectures.

Abstract

Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.
Paper Structure (49 sections, 4 equations, 14 figures, 5 tables)

This paper contains 49 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Comparison of the LAMA and BEAR probes. Both probes query LMs given a template (here in black), the subject of the relation (blue), and the object (orange). LAMA masks the object and predicts a single token as the answer. In BEAR, we create separate textual statements for a set of potential answers and select the statement with the highest (pseudo) log-likelihood as assigned by the LM. This method allows us to include multi-token answers and evaluate causal and masked LMs.
  • Figure 2: The normalized answer frequency of selected relations in the LAMA probe. The outliers are marked with dots. In some relations, a majority class accounts for more than 50% of all instances.
  • Figure 3: For each answer option, a sentence is passed to the LM (here using the template: "The capital of [X] is [Y]." and the subject "Uganda"). The log-likelihood scores assigned by the LM are then used to rank the answer options.
  • Figure 4: Probing scores of different models on BEAR. Model size is represented on a log scale.
  • Figure 5: Comparative analysis of model performance on identical subsets of relations and templates in the T-REx (LAMA) and BEAR datasets using the log-likelihood based evaluation.
  • ...and 9 more figures