A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

Asahi Ushio; Jose Camacho Collados; Steven Schockaert

A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

Asahi Ushio, Jose Camacho Collados, Steven Schockaert

TL;DR

This work introduces RelEntLess, a dataset crafted to evaluate how well models rank named-entity pairs by graded relational satisfaction across five relations. It frames the task as few-shot ranking using a relation description and five prototypical examples, and evaluates a wide range of baselines from embedding methods to large language models and prompts. Results show a strong size-related performance trend among LMs but a consistent gap relative to human performance, with the best models achieving around 0.62 Spearman on average. The study demonstrates both the potential and limitations of current LMs for encoding nuanced relational knowledge beyond traditional knowledge graphs, highlighting directions for future research in graded relations and temporal reasoning.

Abstract

Relations such as "is influenced by", "is known for" or "is a competitor of" are inherently graded: we can rank entity pairs based on how well they satisfy these relations, but it is hard to draw a line between those pairs that satisfy them and those that do not. Such graded relations play a central role in many applications, yet they are typically not covered by existing Knowledge Graphs. In this paper, we consider the possibility of using Large Language Models (LLMs) to fill this gap. To this end, we introduce a new benchmark, in which entity pairs have to be ranked according to how much they satisfy a given graded relation. The task is formulated as a few-shot ranking problem, where models only have access to a description of the relation and five prototypical instances. We use the proposed benchmark to evaluate state-of-the-art relation embedding strategies as well as several recent LLMs, covering both publicly available LLMs and closed models such as GPT-4. Overall, we find a strong correlation between model size and performance, with smaller Language Models struggling to outperform a naive baseline. The results of the largest Flan-T5 and OPT models are remarkably strong, although a clear gap with human performance remains.

A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

TL;DR

Abstract

Paper Structure (23 sections, 12 figures, 14 tables)

This paper contains 23 sections, 12 figures, 14 tables.

Introduction
Related Work
Benchmarks for Graded Relations
Language Models as Knowledge Bases
Dataset
First phase
Second phase
Third phase
Baselines
Human Performance
Embedding Models
Word Embedding.
RelBERT.
Language Models
Results
...and 8 more sections

Figures (12)

Figure 1: Average Spearman's rank correlation results among the five relation types along with the model size.
Figure 2: Spearman's rank correlation averaged over the five relation types with different number of the prototypical examples. For 1-shot and 3-shot examples, we report each correlation of the three individual runs.
Figure 3: Spearman's rank correlation for the competitor/rival of relation type along with the model size.
Figure 4: Spearman's rank correlation for the friend/ally of relation type along with the model size.
Figure 5: Spearman's rank correlation for the influenced by relation type along with the model size.
...and 7 more figures

A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

TL;DR

Abstract

A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

Authors

TL;DR

Abstract

Table of Contents

Figures (12)