Table of Contents
Fetching ...

Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

Qiyao Wei, Edward Morrell, Lea Goetz, Mihaela van der Schaar

TL;DR

This work introduces Semantic-KG, a knowledge-graph–based framework to generate scalable, domain-agnostic semantic similarity benchmarks for evaluating LLM outputs. The pipeline consists of four stages—subgraph sampling, perturbation, response generation, and response validation—grounded in KG perturbations that produce semantically varied statement pairs across four domains. The authors benchmark several semantic-similarity methods, including LLM-as-a-judge and traditional NLP and embedding-based metrics, and show that performance depends on the type of semantic variation and the domain, with no method consistently best. The dataset and code are released to enable reproducible, domain-specific evaluation of semantic understanding in LLMs, informing safer and more reliable deployment in high-stakes settings. Overall, Semantic-KG provides a principled, automatically generated benchmark for probing semantic content beyond surface-level textual similarity, guiding method selection and future research in semantic evaluation.

Abstract

Evaluating the open-form textual responses generated by Large Language Models (LLMs) typically requires measuring the semantic similarity of the response to a (human generated) reference. However, there is evidence that current semantic similarity methods may capture syntactic or lexical forms over semantic content. While benchmarks exist for semantic equivalence, they often suffer from high generation costs due to reliance on subjective human judgment, limited availability for domain-specific applications, and unclear definitions of equivalence. This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, specifically addressing these limitations. Our approach leverages knowledge graphs (KGs) to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into one of four sub-types. We generate benchmark datasets in four different domains (general knowledge, biomedicine, finance, biology), and conduct a comparative study of semantic similarity methods including traditional natural language processing scores and LLM-as-a-judge predictions. We observe that the sub-type of semantic variation, as well as the domain of the benchmark impact the performance of semantic similarity methods, with no method being consistently superior. Our results present important implications for the use of LLM-as-a-judge in detecting the semantic content of text. Code is available at https://github.com/QiyaoWei/semantic-kg and the dataset is available at https://huggingface.co/datasets/QiyaoWei/Semantic-KG.

Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

TL;DR

This work introduces Semantic-KG, a knowledge-graph–based framework to generate scalable, domain-agnostic semantic similarity benchmarks for evaluating LLM outputs. The pipeline consists of four stages—subgraph sampling, perturbation, response generation, and response validation—grounded in KG perturbations that produce semantically varied statement pairs across four domains. The authors benchmark several semantic-similarity methods, including LLM-as-a-judge and traditional NLP and embedding-based metrics, and show that performance depends on the type of semantic variation and the domain, with no method consistently best. The dataset and code are released to enable reproducible, domain-specific evaluation of semantic understanding in LLMs, informing safer and more reliable deployment in high-stakes settings. Overall, Semantic-KG provides a principled, automatically generated benchmark for probing semantic content beyond surface-level textual similarity, guiding method selection and future research in semantic evaluation.

Abstract

Evaluating the open-form textual responses generated by Large Language Models (LLMs) typically requires measuring the semantic similarity of the response to a (human generated) reference. However, there is evidence that current semantic similarity methods may capture syntactic or lexical forms over semantic content. While benchmarks exist for semantic equivalence, they often suffer from high generation costs due to reliance on subjective human judgment, limited availability for domain-specific applications, and unclear definitions of equivalence. This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, specifically addressing these limitations. Our approach leverages knowledge graphs (KGs) to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into one of four sub-types. We generate benchmark datasets in four different domains (general knowledge, biomedicine, finance, biology), and conduct a comparative study of semantic similarity methods including traditional natural language processing scores and LLM-as-a-judge predictions. We observe that the sub-type of semantic variation, as well as the domain of the benchmark impact the performance of semantic similarity methods, with no method being consistently superior. Our results present important implications for the use of LLM-as-a-judge in detecting the semantic content of text. Code is available at https://github.com/QiyaoWei/semantic-kg and the dataset is available at https://huggingface.co/datasets/QiyaoWei/Semantic-KG.

Paper Structure

This paper contains 43 sections, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Difference between semantic and syntactic variations. Two text samples that are syntactically different but semantically equivalent (top), and syntactically similar but semantically different (bottom). ROUGE-1 and ROUGE-L scores are shown for each statement pair, highlighting the limitations of these methods to detect semantic meaning.
  • Figure 2: Overview of the Semantic KG Framework. Semantic KG consists of 4 stages: 1) Sampling: A subgraph is sampled from a knowledge-graph dataset, 2) Perturbation: The knowledge-graph is perturbed, 3) Generation: Textual statements are generated from the subgraph and perturbed subgraphs, 4) Validation: Statements are validated for correctness using reconstruction accuracy.
  • Figure 3: Overview of the Semantic KG Task. Positive response pairs top left, are generated by sampling 2 responses from the same subgraph. Negative response pairs bottom left are generated by sampling a response from the original subgraph and a perturbed subgraph. The model is tasked with predicting the label of the response pairs.
  • Figure 4: Semantic Performance by Perturbation-Type. Performance (F1 Score) of different semantic similarity models stratified by perturbation-type. Error-bars display Clopper-Pearson 95% confidence intervals.
  • Figure 5: Semantic Performance by Dataset. Performance (F1 Score) of different semantic similarity models stratified by dataset. Error-bars display Clopper-Pearson 95% confidence intervals.
  • ...and 6 more figures