Table of Contents
Fetching ...

Certifying Knowledge Comprehension in LLMs

Isha Chaudhary, Vedaant V. Jain, Gagandeep Singh

TL;DR

The paper tackles the reliability and generalization limits of LLMs on knowledge-comprehension tasks by replacing static benchmarks with specification-driven probability guarantees. It introduces LLMCert-C, a probabilistic certification framework that builds knowledge-graph-based specifications (over PrimeKG and Wikidata5m) incorporating natural noise and multi-hop reasoning, and then estimates the probability $p$ of correct answers using $n$ iid samples to produce Clopper-Pearson intervals with confidence $1-\\delta$. The framework operates in a black-box setting, enabling certification of API-based or closed-source models, and delivers rigorous bounds $[p_l, p_u]$ to compare SOTA models while accounting for prompt noise. Experimental results show vulnerabilities to distractors, establish performance hierarchies across models, and reveal potential degradation due to quantization, underscoring the need for principled, guarantee-driven evaluation in safety-critical deployments. The work provides open-source tooling and demonstrates that specification-based certification can supplant unreliable benchmarking in high-stakes QA tasks.

Abstract

Large Language Models (LLMs) are increasingly deployed in safety-critical systems where they provide answers based on in-context information derived from knowledge bases. As LLMs are increasingly envisioned as superhuman agents, their proficiency in knowledge comprehension-extracting relevant information and reasoning over it to answer questions, a key facet of human intelligence-becomes crucial. However, existing evaluations of LLMs on knowledge comprehension are typically conducted on small test sets, but these datasets represent only a tiny fraction of the vast number of possible queries. Simple empirical evaluations on these limited test sets raises concerns about the reliability and generalizability of the results. In this work, we introduce the first specification and certification framework for knowledge comprehension in LLMs, providing formal probabilistic guarantees for reliability. Instead of a fixed dataset, we design novel specifications that mathematically represent prohibitively large probability distributions of knowledge comprehension prompts with natural noise, using knowledge graphs. From these specifications, we generate quantitative certificates that offer high-confidence, tight bounds on the probability that a given LLM correctly answers any question drawn from the specification distribution. We apply our framework to certify SOTA LLMs in two domains: precision medicine and general question-answering. Our results reveal previously unrecognized vulnerabilities in SOTA LLMs due to natural noise in the prompts. Additionally, we establish performance hierarchies with formal guarantees among the SOTA LLMs, particularly in the context of precision medicine question-answering.

Certifying Knowledge Comprehension in LLMs

TL;DR

The paper tackles the reliability and generalization limits of LLMs on knowledge-comprehension tasks by replacing static benchmarks with specification-driven probability guarantees. It introduces LLMCert-C, a probabilistic certification framework that builds knowledge-graph-based specifications (over PrimeKG and Wikidata5m) incorporating natural noise and multi-hop reasoning, and then estimates the probability of correct answers using iid samples to produce Clopper-Pearson intervals with confidence . The framework operates in a black-box setting, enabling certification of API-based or closed-source models, and delivers rigorous bounds to compare SOTA models while accounting for prompt noise. Experimental results show vulnerabilities to distractors, establish performance hierarchies across models, and reveal potential degradation due to quantization, underscoring the need for principled, guarantee-driven evaluation in safety-critical deployments. The work provides open-source tooling and demonstrates that specification-based certification can supplant unreliable benchmarking in high-stakes QA tasks.

Abstract

Large Language Models (LLMs) are increasingly deployed in safety-critical systems where they provide answers based on in-context information derived from knowledge bases. As LLMs are increasingly envisioned as superhuman agents, their proficiency in knowledge comprehension-extracting relevant information and reasoning over it to answer questions, a key facet of human intelligence-becomes crucial. However, existing evaluations of LLMs on knowledge comprehension are typically conducted on small test sets, but these datasets represent only a tiny fraction of the vast number of possible queries. Simple empirical evaluations on these limited test sets raises concerns about the reliability and generalizability of the results. In this work, we introduce the first specification and certification framework for knowledge comprehension in LLMs, providing formal probabilistic guarantees for reliability. Instead of a fixed dataset, we design novel specifications that mathematically represent prohibitively large probability distributions of knowledge comprehension prompts with natural noise, using knowledge graphs. From these specifications, we generate quantitative certificates that offer high-confidence, tight bounds on the probability that a given LLM correctly answers any question drawn from the specification distribution. We apply our framework to certify SOTA LLMs in two domains: precision medicine and general question-answering. Our results reveal previously unrecognized vulnerabilities in SOTA LLMs due to natural noise in the prompts. Additionally, we establish performance hierarchies with formal guarantees among the SOTA LLMs, particularly in the context of precision medicine question-answering.
Paper Structure (46 sections, 4 equations, 14 figures, 4 tables, 8 algorithms)

This paper contains 46 sections, 4 equations, 14 figures, 4 tables, 8 algorithms.

Figures (14)

  • Figure 1: Overview of LLMCert-C. (a) A KG ${\mathcal{G}}$. (b) A randomly chosen path DAG ${\mathcal{G}}'$ originating at fixed node 'Paracetamol Overdose', from the various other possible DAGs from node in ${\mathcal{G}}$. (c) A prompt created from ${\mathcal{G}}'$ having contexts of the nodes in ${\mathcal{G}}'$, a distractor context (highlighted in orange, as the node for 'Vomiting' is a distractor for ${\mathcal{G}}'$), and a query from ${\mathcal{G}}'$. (d) The target LLM's response to the prompt, validated using the correct answer. (e) Certifier obtains bounds on the probability of correct response using $n$ samples of LLM responses to randomly sampled prompts.
  • Figure 2: Samples from PrimeKG's Distractor specifications. Prompt context not shown for brevity. Wrong/correct responses are colored red/green respectively. Samples are consistent with certificates, where Phi-3 (14B) has lower bounds than GPT-4o and Gemini-Pro.
  • Figure 3: Negative effects of distractors on LLM performance for answering queries from same DAGs (correct and wrong answers in green and red respectively).
  • Figure 4: A subgraph of Wikidata5m
  • Figure 5: Variation in the proportion of Clopper-Pearson confidence intervals containing the true probability of correct response
  • ...and 9 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3