Table of Contents
Fetching ...

MedG-KRP: Medical Graph Knowledge Representation Probing

Gabriel R. Rosenbaum, Lavender Yao Jiang, Ivaxi Sheth, Jaden Stryker, Anton Alyakin, Daniel Alexander Alber, Nicolas K. Goff, Young Joon Fred Kwon, John Markert, Mustafa Nasir-Moin, Jan Moritz Niehues, Karl L. Sangwon, Eunice Yang, Eric Karl Oermann

TL;DR

MedG--KRP introduces a knowledge-graph-based probing framework to map the medical reasoning embedded in LLMs. It generates complete causal graphs from a single medical concept via a two-stage process: node expansion and edge refinement, with human and BIOS-ground-truth evaluation. The study compares GPT-4, Llama3-70b, and PalmyraMed-70b, revealing a trade-off between human-centered accuracy and ground-truth precision/recall, and highlighting that generalist models may cover broader concepts while medical-tuned models can be more specific but risk mislabeling causality. The framework offers a path toward transparent, explainable clinical reasoning and potential KG repair, with future work focusing on improving causal inference and prompting strategies for safe clinical deployment.

Abstract

Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications. LLMs' ability to coalesce vast amounts of information from many sources to generate a response-a process similar to that of a human expert-has led many to see potential in deploying LLMs for clinical use. However, medicine is a setting where accurate reasoning is paramount. Many researchers are questioning the effectiveness of multiple choice question answering (MCQA) benchmarks, frequently used to test LLMs. Researchers and clinicians alike must have complete confidence in LLMs' abilities for them to be deployed in a medical setting. To address this need for understanding, we introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs. Essentially, we map how LLMs link medical concepts in order to better understand how they reason. We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model. We enlist a panel of medical students to review a total of 60 LLM-generated graphs and compare these graphs to BIOS, a large biomedical KG. We observe GPT-4 to perform best in our human review but worst in our ground truth comparison; vice-versa with PalmyraMed, the medical model. Our work provides a means of visualizing the medical reasoning pathways of LLMs so they can be implemented in clinical settings safely and effectively.

MedG-KRP: Medical Graph Knowledge Representation Probing

TL;DR

MedG--KRP introduces a knowledge-graph-based probing framework to map the medical reasoning embedded in LLMs. It generates complete causal graphs from a single medical concept via a two-stage process: node expansion and edge refinement, with human and BIOS-ground-truth evaluation. The study compares GPT-4, Llama3-70b, and PalmyraMed-70b, revealing a trade-off between human-centered accuracy and ground-truth precision/recall, and highlighting that generalist models may cover broader concepts while medical-tuned models can be more specific but risk mislabeling causality. The framework offers a path toward transparent, explainable clinical reasoning and potential KG repair, with future work focusing on improving causal inference and prompting strategies for safe clinical deployment.

Abstract

Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications. LLMs' ability to coalesce vast amounts of information from many sources to generate a response-a process similar to that of a human expert-has led many to see potential in deploying LLMs for clinical use. However, medicine is a setting where accurate reasoning is paramount. Many researchers are questioning the effectiveness of multiple choice question answering (MCQA) benchmarks, frequently used to test LLMs. Researchers and clinicians alike must have complete confidence in LLMs' abilities for them to be deployed in a medical setting. To address this need for understanding, we introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs. Essentially, we map how LLMs link medical concepts in order to better understand how they reason. We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model. We enlist a panel of medical students to review a total of 60 LLM-generated graphs and compare these graphs to BIOS, a large biomedical KG. We observe GPT-4 to perform best in our human review but worst in our ground truth comparison; vice-versa with PalmyraMed, the medical model. Our work provides a means of visualizing the medical reasoning pathways of LLMs so they can be implemented in clinical settings safely and effectively.

Paper Structure

This paper contains 40 sections, 11 tables, 3 algorithms.