A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs

Bradley P. Allen; Paul T. Groth

A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs

Bradley P. Allen, Paul T. Groth

TL;DR

The paper tackles whether metalinguistic disagreements—disputes about language meaning rather than factual content—emerge when LLMs are used alongside knowledge graphs for fact-checking. It argues that such disagreements can complicate KG engineering and introduces a benchmark concept to detect and distinguish factual from metalinguistic disagreements, using the T-REx dataset for a proof-of-concept. Initial evidence from a small-scale experiment with 250 T-REx-aligned triples across nine LLMs shows non-negligible rates of metalinguistic disagreement and highlights cases where the disagreement centers on predicate meaning rather than world facts. The authors propose a more comprehensive, human-annotated benchmark with multiple KG sources and contextual cues, designed to improve evaluation validity and guide ontology and prompt-design strategies in knowledge-grounded NLP applications.

Abstract

Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents differ not on facts but on the meaning of the language used to express them. Given the complexity of natural language processing and generation using LLMs, we ask: do metalinguistic disagreements occur between LLMs and KGs? Based on an investigation using the T-REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. We propose a benchmark for evaluating the detection of factual and metalinguistic disagreements between LLMs and KGs. An initial proof of concept of such a benchmark is available on Github.

A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs

TL;DR

Abstract

A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs

TL;DR

Abstract

Paper Structure

Table of Contents