Table of Contents
Fetching ...

KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques

Rui Yang, Haoran Liu, Edison Marrese-Taylor, Qingcheng Zeng, Yu He Ke, Wanxin Li, Lechao Cheng, Qingyu Chen, James Caverlee, Yutaka Matsuo, Irene Li

TL;DR

The paper tackles the challenge of factual inconsistency in large language models when answering medical questions by augmenting LLMs with a structured medical knowledge graph and multiple ranking steps. It introduces KG-Rank, which extracts medical entities from questions, retrieves one-hop KG triples from UMLS, and applies Similarity, Answer Expansion, and MMR ranking, followed by MedCPT re-ranking before generating long-form answers. Across four medical QA datasets, KG-Rank yields over an 18% improvement in ROUGE-L, and demonstrates a 14% improvement in open-domain tasks, highlighting strong gains and transferability. The approach offers a scalable path to improve factuality in medical QA and shows promise for broader applicability beyond medicine.

Abstract

Large language models (LLMs) have demonstrated impressive generative capabilities with the potential to innovate in medicine. However, the application of LLMs in real clinical settings remains challenging due to the lack of factual consistency in the generated content. In this work, we develop an augmented LLM framework, KG-Rank, which leverages a medical knowledge graph (KG) along with ranking and re-ranking techniques, to improve the factuality of long-form question answering (QA) in the medical domain. Specifically, when receiving a question, KG-Rank automatically identifies medical entities within the question and retrieves the related triples from the medical KG to gather factual information. Subsequently, KG-Rank innovatively applies multiple ranking techniques to refine the ordering of these triples, providing more relevant and precise information for LLM inference. To the best of our knowledge, KG-Rank is the first application of KG combined with ranking models in medical QA specifically for generating long answers. Evaluation on four selected medical QA datasets demonstrates that KG-Rank achieves an improvement of over 18% in ROUGE-L score. Additionally, we extend KG-Rank to open domains, including law, business, music, and history, where it realizes a 14% improvement in ROUGE-L score, indicating the effectiveness and great potential of KG-Rank.

KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques

TL;DR

The paper tackles the challenge of factual inconsistency in large language models when answering medical questions by augmenting LLMs with a structured medical knowledge graph and multiple ranking steps. It introduces KG-Rank, which extracts medical entities from questions, retrieves one-hop KG triples from UMLS, and applies Similarity, Answer Expansion, and MMR ranking, followed by MedCPT re-ranking before generating long-form answers. Across four medical QA datasets, KG-Rank yields over an 18% improvement in ROUGE-L, and demonstrates a 14% improvement in open-domain tasks, highlighting strong gains and transferability. The approach offers a scalable path to improve factuality in medical QA and shows promise for broader applicability beyond medicine.

Abstract

Large language models (LLMs) have demonstrated impressive generative capabilities with the potential to innovate in medicine. However, the application of LLMs in real clinical settings remains challenging due to the lack of factual consistency in the generated content. In this work, we develop an augmented LLM framework, KG-Rank, which leverages a medical knowledge graph (KG) along with ranking and re-ranking techniques, to improve the factuality of long-form question answering (QA) in the medical domain. Specifically, when receiving a question, KG-Rank automatically identifies medical entities within the question and retrieves the related triples from the medical KG to gather factual information. Subsequently, KG-Rank innovatively applies multiple ranking techniques to refine the ordering of these triples, providing more relevant and precise information for LLM inference. To the best of our knowledge, KG-Rank is the first application of KG combined with ranking models in medical QA specifically for generating long answers. Evaluation on four selected medical QA datasets demonstrates that KG-Rank achieves an improvement of over 18% in ROUGE-L score. Additionally, we extend KG-Rank to open domains, including law, business, music, and history, where it realizes a 14% improvement in ROUGE-L score, indicating the effectiveness and great potential of KG-Rank.
Paper Structure (26 sections, 3 equations, 9 figures, 8 tables)

This paper contains 26 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: An illustration of KG-Rank Framework.
  • Figure 2: BERTScore comparison: zero-shot setting with LLaMa2-7b and Baize-Healthcare. Ep stands for ExpertQA.
  • Figure 3: A case study from ExpertQA-Med: results from LLaMa2-13b and with KG-Rank.
  • Figure 4: Prompt used to extract medical terminologies.
  • Figure 5: Prompt for answer expansion ranking technique.
  • ...and 4 more figures