LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models

Max Ploner; Jacek Wiland; Sebastian Pohl; Alan Akbik

LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models

Max Ploner, Jacek Wiland, Sebastian Pohl, Alan Akbik

TL;DR

The paper tackles the challenge of reliably evaluating relational knowledge stored in pre-trained LMs, addressing skew and interpretability issues of prior probes. It introduces LM-Pub-Quiz, an open-source framework built around the BEAR probe that standardizes data interfaces, evaluation, and analysis, and integrates with the Hugging Face ecosystem for both standalone evaluation and training-time monitoring. Key contributions include a three-object API (Dataset, Evaluator, DatasetResult), fine-grained analysis options (domain, cardinality, relation-level, per-instance), and a trainer-integrated callback for continual-learning contexts, along with a competitive leaderboard. The framework demonstrates domain-adaptation insights, biases, and forgetting dynamics across CL scenarios, offering a practical tool for researchers to quantify and compare relational knowledge across models and training regimes, with broad applicability to domain adaptation and continual learning research.

Abstract

Knowledge probing evaluates the extent to which a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-PUB- QUIZ, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face TRANSFORMERS library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-PUB-QUIZ as an open-source project.

LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models

TL;DR

Abstract

Paper Structure (31 sections, 6 figures, 4 tables)

This paper contains 31 sections, 6 figures, 4 tables.

Introduction
Framework.
Framework Overview
Setup
Interface
Dataset
Evaluator
DatasetResult
Direct Evaluation of a Trained LM
Monitoring Knowledge during Training
Analysis Options
Comparison with Existing Libraries
Example Experiments
Domain-specific Knowledge after Training on Different Corpora
Experimental Setup
...and 16 more sections

Figures (6)

Figure 1: The BEAR probe uses relational triplets from a knowledge base (KB) to construct multiple-choice items. Here, it leverages the knowledge that "Kampala" is the capital of "Uganda", while "Thimpu", "Buenos Aires" and "Bandar Seri Begawan" (other capital cities) are not. It measures whether the LM correctly ranks the verbalization of the true fact higher than the distractors.
Figure 2: Example screenshot from TensorBoard, showcasing the Hugging Face Trainer integration of LM-Pub-Quiz. Here, we monitor the knowledge of 4 roberta-base models liuRoBERTaRobustlyOptimized2019, continuously pretrained on permutations of the Wikitext corpus.
Figure 3: Scores for different BEAR domains for models trained on different corpora.
Figure 4: Performance of the selected models on the P30 relation of the BEAR probe averaged over relation templates, including min and max values as bars.
Figure 5: Trajectories of the knowledge represented in a bert-base-cased model throughout the continual learning process as measured by LM-Pub-Quiz and [MASK]-predict on T-REx dataset. Additionally, the performance of bert-base-cased evaluated on the BEAR probe is shown.
...and 1 more figures

LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models

TL;DR

Abstract

LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)