Table of Contents
Fetching ...

A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

Asahi Ushio, Jose Camacho Collados, Steven Schockaert

TL;DR

This work introduces RelEntLess, a dataset crafted to evaluate how well models rank named-entity pairs by graded relational satisfaction across five relations. It frames the task as few-shot ranking using a relation description and five prototypical examples, and evaluates a wide range of baselines from embedding methods to large language models and prompts. Results show a strong size-related performance trend among LMs but a consistent gap relative to human performance, with the best models achieving around 0.62 Spearman on average. The study demonstrates both the potential and limitations of current LMs for encoding nuanced relational knowledge beyond traditional knowledge graphs, highlighting directions for future research in graded relations and temporal reasoning.

Abstract

Relations such as "is influenced by", "is known for" or "is a competitor of" are inherently graded: we can rank entity pairs based on how well they satisfy these relations, but it is hard to draw a line between those pairs that satisfy them and those that do not. Such graded relations play a central role in many applications, yet they are typically not covered by existing Knowledge Graphs. In this paper, we consider the possibility of using Large Language Models (LLMs) to fill this gap. To this end, we introduce a new benchmark, in which entity pairs have to be ranked according to how much they satisfy a given graded relation. The task is formulated as a few-shot ranking problem, where models only have access to a description of the relation and five prototypical instances. We use the proposed benchmark to evaluate state-of-the-art relation embedding strategies as well as several recent LLMs, covering both publicly available LLMs and closed models such as GPT-4. Overall, we find a strong correlation between model size and performance, with smaller Language Models struggling to outperform a naive baseline. The results of the largest Flan-T5 and OPT models are remarkably strong, although a clear gap with human performance remains.

A RelEntLess Benchmark for Modelling Graded Relations between Named Entities

TL;DR

This work introduces RelEntLess, a dataset crafted to evaluate how well models rank named-entity pairs by graded relational satisfaction across five relations. It frames the task as few-shot ranking using a relation description and five prototypical examples, and evaluates a wide range of baselines from embedding methods to large language models and prompts. Results show a strong size-related performance trend among LMs but a consistent gap relative to human performance, with the best models achieving around 0.62 Spearman on average. The study demonstrates both the potential and limitations of current LMs for encoding nuanced relational knowledge beyond traditional knowledge graphs, highlighting directions for future research in graded relations and temporal reasoning.

Abstract

Relations such as "is influenced by", "is known for" or "is a competitor of" are inherently graded: we can rank entity pairs based on how well they satisfy these relations, but it is hard to draw a line between those pairs that satisfy them and those that do not. Such graded relations play a central role in many applications, yet they are typically not covered by existing Knowledge Graphs. In this paper, we consider the possibility of using Large Language Models (LLMs) to fill this gap. To this end, we introduce a new benchmark, in which entity pairs have to be ranked according to how much they satisfy a given graded relation. The task is formulated as a few-shot ranking problem, where models only have access to a description of the relation and five prototypical instances. We use the proposed benchmark to evaluate state-of-the-art relation embedding strategies as well as several recent LLMs, covering both publicly available LLMs and closed models such as GPT-4. Overall, we find a strong correlation between model size and performance, with smaller Language Models struggling to outperform a naive baseline. The results of the largest Flan-T5 and OPT models are remarkably strong, although a clear gap with human performance remains.
Paper Structure (23 sections, 12 figures, 14 tables)

This paper contains 23 sections, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Average Spearman's rank correlation results among the five relation types along with the model size.
  • Figure 2: Spearman's rank correlation averaged over the five relation types with different number of the prototypical examples. For 1-shot and 3-shot examples, we report each correlation of the three individual runs.
  • Figure 3: Spearman's rank correlation for the competitor/rival of relation type along with the model size.
  • Figure 4: Spearman's rank correlation for the friend/ally of relation type along with the model size.
  • Figure 5: Spearman's rank correlation for the influenced by relation type along with the model size.
  • ...and 7 more figures