Table of Contents
Fetching ...

KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models

Yuyang Bai, Shangbin Feng, Vidhisha Balachandran, Zhaoxuan Tan, Shiqi Lou, Tianxing He, Yulia Tsvetkov

TL;DR

KGQuiz introduces a knowledge-intensive benchmark to study how well LLM-encoded knowledge generalizes across domains and task formats. It formalizes knowledge as triplets in knowledge graphs $\\mathcal{T}=\\{(h,r,t)\\}$ and defines five tasks of increasing complexity, from True-or-False to Open-Ended Text Generation, with diverse negative sampling and semantic-matching metrics. Evaluations across three knowledge domains and ten LLMs show that performance is highly dependent on domain and task format, with simple QA easier and domain-specific, multi-hop reasoning significantly harder. The benchmark provides a scalable, extensible testbed to diagnose gaps in LLM knowledge and guide targeted improvements and KG-augmented approaches.

Abstract

Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.

KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models

TL;DR

KGQuiz introduces a knowledge-intensive benchmark to study how well LLM-encoded knowledge generalizes across domains and task formats. It formalizes knowledge as triplets in knowledge graphs and defines five tasks of increasing complexity, from True-or-False to Open-Ended Text Generation, with diverse negative sampling and semantic-matching metrics. Evaluations across three knowledge domains and ten LLMs show that performance is highly dependent on domain and task format, with simple QA easier and domain-specific, multi-hop reasoning significantly harder. The benchmark provides a scalable, extensible testbed to diagnose gaps in LLM knowledge and guide targeted improvements and KG-augmented approaches.

Abstract

Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.
Paper Structure (57 sections, 6 figures, 10 tables)

This paper contains 57 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of the KGQuiz Benchmark, featuring five knowledge-intensive tasks with increasing complexity. We illustrate the diverse tasks employed in KGQuiz to test large language models, highlighting the examples and corresponding natural language prompts used to examine their knowledge abilities across domains and contexts.
  • Figure 2: Model performance on Task 1: True-or-False. Larger LMs are better at judging factual correctness, while the same LM performs differently across varying knowledge domains.
  • Figure 3: LLM performance on Task 2: Multiple-Choice. Davinci and Turbo consistently outperform other models, indicating their superior knowledge abilities under the multiple-choice knowledge utilization format.
  • Figure 4: Performance on Task 1: Ture-or-False with varying negative sampling methods. The choice of negative sampling has a significant impact on the difficulty of the task.
  • Figure 5: Comparison of model performance across different question sampling methods. Models are evaluated on 1,000 Task 1: True-or-False questions and 1,000 Task 2: Multiple-Choice questions sampled via three different methods.
  • ...and 1 more figures