KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models
Yuyang Bai, Shangbin Feng, Vidhisha Balachandran, Zhaoxuan Tan, Shiqi Lou, Tianxing He, Yulia Tsvetkov
TL;DR
KGQuiz introduces a knowledge-intensive benchmark to study how well LLM-encoded knowledge generalizes across domains and task formats. It formalizes knowledge as triplets in knowledge graphs $\\mathcal{T}=\\{(h,r,t)\\}$ and defines five tasks of increasing complexity, from True-or-False to Open-Ended Text Generation, with diverse negative sampling and semantic-matching metrics. Evaluations across three knowledge domains and ten LLMs show that performance is highly dependent on domain and task format, with simple QA easier and domain-specific, multi-hop reasoning significantly harder. The benchmark provides a scalable, extensible testbed to diagnose gaps in LLM knowledge and guide targeted improvements and KG-augmented approaches.
Abstract
Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.
