Table of Contents
Fetching ...

Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs

Shasha Zhou, Mingyu Huang, Jack Cole, Charles Britton, Ming Yin, Jan Wolber, Ke Li

TL;DR

This work tackles the problem of evaluating factuality in medical LLM outputs by introducing FAITH, an unsupervised, reference-free framework that grounds claims in a medical knowledge graph. FAITH decomposes responses into atomic claims, maps entities via UMLS, finds short evidence paths, and computes per-claim and overall factuality scores with interpretable explanations. Empirical results show FAITH correlates more strongly with clinician judgments than traditional NLP metrics or LLМ judges, is robust to paraphrase, and can be used to safeguard deployments via RTA and RAG, with successful applicability to medical summarization and MFV. The study also highlights dependencies on KG quality and claim extraction accuracy, suggesting future work to broaden KG coverage and improve extraction reliability.

Abstract

The recent proliferation of large language models (LLMs) holds the potential to revolutionize healthcare, with strong capabilities in diverse medical tasks. Yet, deploying LLMs in high-stakes healthcare settings requires rigorous verification and validation to understand any potential harm. This paper investigates the reliability and viability of using medical knowledge graphs (KGs) for the automated factuality evaluation of LLM-generated responses. To ground this investigation, we introduce FAITH, a framework designed to systematically probe the strengths and limitations of this KG-based approach. FAITH operates without reference answers by decomposing responses into atomic claims, linking them to a medical KG, and scoring them based on evidence paths. Experiments on diverse medical tasks with human subjective evaluations demonstrate that KG-grounded evaluation achieves considerably higher correlations with clinician judgments and can effectively distinguish LLMs with varying capabilities. It is also robust to textual variances. The inherent explainability of its scoring can further help users understand and mitigate the limitations of current LLMs. We conclude that while limitations exist, leveraging KGs is a prominent direction for automated factuality assessment in healthcare.

Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs

TL;DR

This work tackles the problem of evaluating factuality in medical LLM outputs by introducing FAITH, an unsupervised, reference-free framework that grounds claims in a medical knowledge graph. FAITH decomposes responses into atomic claims, maps entities via UMLS, finds short evidence paths, and computes per-claim and overall factuality scores with interpretable explanations. Empirical results show FAITH correlates more strongly with clinician judgments than traditional NLP metrics or LLМ judges, is robust to paraphrase, and can be used to safeguard deployments via RTA and RAG, with successful applicability to medical summarization and MFV. The study also highlights dependencies on KG quality and claim extraction accuracy, suggesting future work to broaden KG coverage and improve extraction reliability.

Abstract

The recent proliferation of large language models (LLMs) holds the potential to revolutionize healthcare, with strong capabilities in diverse medical tasks. Yet, deploying LLMs in high-stakes healthcare settings requires rigorous verification and validation to understand any potential harm. This paper investigates the reliability and viability of using medical knowledge graphs (KGs) for the automated factuality evaluation of LLM-generated responses. To ground this investigation, we introduce FAITH, a framework designed to systematically probe the strengths and limitations of this KG-based approach. FAITH operates without reference answers by decomposing responses into atomic claims, linking them to a medical KG, and scoring them based on evidence paths. Experiments on diverse medical tasks with human subjective evaluations demonstrate that KG-grounded evaluation achieves considerably higher correlations with clinician judgments and can effectively distinguish LLMs with varying capabilities. It is also robust to textual variances. The inherent explainability of its scoring can further help users understand and mitigate the limitations of current LLMs. We conclude that while limitations exist, leveraging KGs is a prominent direction for automated factuality assessment in healthcare.

Paper Structure

This paper contains 18 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of FAITH.(a) An example medical content generated by an LLM. (b)FAITH processes structured claims relating different medical entities in the response, which are automatically extracted by an LLM. (c) Extracted claims are matched with nodes (maroon) in a medical KG. Paths (maroon) and intermediate nodes (blue) linking the entities are identified. (d) A factuality score for each claim is computed based on the path characteristics and edge semantics. The response score is aggregated from individual claim scores.
  • Figure 2: FAITH effectively distinguishes between LLMs and is robust to noise. This figure shows the mean factuality scores assigned by FAITH and baseline metrics to responses from five LLMs on four datasets: MedQA (a), MMLU (b), MS-AKT (e), and LiveQA (g). A reliable metric should assign distinguishable scores across different models. Panels c, d, f, and h display the corresponding coefficients of variation (CV) of these scores under noisy conditions, introduced by generating $10$ paraphrased versions per response.
  • Figure 3: FAITH exhibits highest correlation with clinician judgments.a, Clinician evaluation scores for answers to $16$ questions generated by GPT-4o, Llama 3.1, and official answers. b, Scatter plots correlating clinician judgments with scores from FAITH. Linear regression fits are shown with $95\%$ confidence intervals. c, Pearson correlation coefficients ($\rho$) between clinicians and scores from FAITH and various baselines.
  • Figure 4: Explainability of FAITH via faithful error identification and LLM limitation analysis.a, Alignment between clinician-identified incorrect claims and FAITH's lowest-scoring claims in LLM responses, shown by a confusion matrix. b, Distribution of the top-5 most frequent KG relation types linked to incorrect claims in GPT-4o's responses, as identified by FAITH.
  • Figure 5: FAITH enhances LLM factuality via selective intervention. GPT-4o performance on MedQA using Reject-to-Answer (RTA) or Retrieval-Augmented Generation (RAG). Interventions triggered by either FAITH scores or a model uncertainty baseline. a, Question-answering accuracy. b,FAITH factuality scores for responses. Metrics plotted against the percentage (x-axis) of responses selected for intervention.
  • ...and 1 more figures