Table of Contents
Fetching ...

Mitigating LLM Hallucinations with Knowledge Graphs: A Case Study

Harry Li, Gabriel Appleby, Kenneth Alperin, Steven R Gomez, Ashley Suh

TL;DR

The paper addresses hallucinations in LLMs when used for high-stakes cyber operations by grounding questions in a knowledge graph. It introduces LinkQ, a human-in-the-loop NL-to-KG querying system that forces the LLM to query a KG for ground-truth data before answering, thereby reducing hallucinations. Quantitative evaluation on Mintaka/Wikidata shows LinkQ outperforming GPT-4 in KG-query accuracy across most question types, while a qualitative BRON-based study with domain experts reveals practical usability insights and concrete feature requests. The findings suggest that KG-grounded QA with controlled query generation can enable safer, more trustworthy LLM-assisted decision support, though complex queries still demand advanced prompting strategies and user-in-the-loop mechanisms.

Abstract

High-stakes domains like cyber operations need responsible and trustworthy AI methods. While large language models (LLMs) are becoming increasingly popular in these domains, they still suffer from hallucinations. This research paper provides learning outcomes from a case study with LinkQ, an open-source natural language interface that was developed to combat hallucinations by forcing an LLM to query a knowledge graph (KG) for ground-truth data during question-answering (QA). We conduct a quantitative evaluation of LinkQ using a well-known KGQA dataset, showing that the system outperforms GPT-4 but still struggles with certain question categories - suggesting that alternative query construction strategies will need to be investigated in future LLM querying systems. We discuss a qualitative study of LinkQ with two domain experts using a real-world cybersecurity KG, outlining these experts' feedback, suggestions, perceived limitations, and future opportunities for systems like LinkQ.

Mitigating LLM Hallucinations with Knowledge Graphs: A Case Study

TL;DR

The paper addresses hallucinations in LLMs when used for high-stakes cyber operations by grounding questions in a knowledge graph. It introduces LinkQ, a human-in-the-loop NL-to-KG querying system that forces the LLM to query a KG for ground-truth data before answering, thereby reducing hallucinations. Quantitative evaluation on Mintaka/Wikidata shows LinkQ outperforming GPT-4 in KG-query accuracy across most question types, while a qualitative BRON-based study with domain experts reveals practical usability insights and concrete feature requests. The findings suggest that KG-grounded QA with controlled query generation can enable safer, more trustworthy LLM-assisted decision support, though complex queries still demand advanced prompting strategies and user-in-the-loop mechanisms.

Abstract

High-stakes domains like cyber operations need responsible and trustworthy AI methods. While large language models (LLMs) are becoming increasingly popular in these domains, they still suffer from hallucinations. This research paper provides learning outcomes from a case study with LinkQ, an open-source natural language interface that was developed to combat hallucinations by forcing an LLM to query a knowledge graph (KG) for ground-truth data during question-answering (QA). We conduct a quantitative evaluation of LinkQ using a well-known KGQA dataset, showing that the system outperforms GPT-4 but still struggles with certain question categories - suggesting that alternative query construction strategies will need to be investigated in future LLM querying systems. We discuss a qualitative study of LinkQ with two domain experts using a real-world cybersecurity KG, outlining these experts' feedback, suggestions, perceived limitations, and future opportunities for systems like LinkQ.

Paper Structure

This paper contains 7 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: An overview of LinkQ's li2024linkq human-in-the-loop system design, described in Section \ref{['sec:design']}. The LLM, System, and User have their own responsibilities for completing the question-to-query translation: the LLM constructs queries, the System works to illuminate possible hallucinations, and the User iterates with the System and LLM to modify, refine, or follow-up on generated queries.
  • Figure 2: Results from our quantitative evaluation (Section \ref{['sec:results']}) comparing LinkQ (blue) and GPT-4 (orange). Left: LinkQ and GPT-4's overall question accuracy on 24 questions for each question type (x-axis), with a breakdown showing the number of correct attempts per question (y-axis). Right: LinkQ versus GPT-4's runtime to generate a corresponding query.