Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Eden Avnat; Michal Levy; Daniel Herstain; Elia Yanko; Daniel Ben Joya; Michal Tzuchman Katz; Dafna Eshel; Sahar Laros; Yael Dagan; Shahar Barami; Joseph Mermelstein; Shahar Ovadia; Noam Shomron; Varda Shalev; Raja-Elie E. Abdulnour

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie E. Abdulnour

TL;DR

The paper addresses the challenge of evaluating large language models on evidence-based medical knowledge by separating semantic reasoning from numeric diagnostic data. It introduces EBMQA, a large-scale Q&A dataset derived from the Kahun knowledge graph, and benchmarks GPT-4 and Claude3 on 24,542 questions spanning semantic and numeric types, alongside human validation. Results show LLMs perform better on semantic questions, with Claude3 surpassing GPT-4 in numeric accuracy, yet both lag behind human experts in numeric reasoning, indicating cautious integration of LLMs into clinical decision-making. The study demonstrates the utility of knowledge-graph–based QA benchmarks for assessing model reliability across medical disciplines and highlights the need for ongoing, discipline-specific benchmarking as models evolve.

Abstract

Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate non-language evidence-based answers to clinical questions is inherently limited by tokenization. Therefore, we evaluated LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities) while examining differences within and between LLMs in medical aspects and comparing their performance to humans. To generate straightforward multi-choice questions and answers (QAs) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA". EBMQA contains 105,000 QAs labeled with medical and non-medical topics and classified into numerical or semantic questions. We benchmarked this dataset using more than 24,500 QAs on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We evaluated the LLMs accuracy on semantic and numerical question types and according to sub-labeled topics. For validation, six medical experts were tested on 100 numerical EBMQA questions. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs. However, both LLMs showed inter and intra gaps in different medical aspects and remained inferior to humans. Thus, their medical advice should be addressed carefully.

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

TL;DR

Abstract

Paper Structure (40 sections, 8 figures, 4 tables)

This paper contains 40 sections, 8 figures, 4 tables.

Introduction
Methods
EBMQA
Kahun
Questions Structure
Multiple Choice Questions Structure
Numerical Data and Possible Answers
QA exclusion
Labeling
Benchmark Analysis
QA selection and subanalysis
LLMs prompting
Evaluating LLM’s performance
Prompt sensitivity analysis
Human validation
...and 25 more sections

Figures (8)

Figure 1: The flowchart of the study: From Kahun's knowledge graph, which references source, target, and background as edges of the graph (1-2), to the EBMQA dataset and the LLM benchmarking (3-4), which includes both numeric and semantic QAs.
Figure 2: Validation test: Each LLM was tested eight times- four times with the option to “I do not know” (abstain), using the same prompt though in a different order of possible answers, and four times without the abstain option, using the same prompt though in a different order of possible answers. Additionally, six medical experts were tested: first, with the option to abstain, and then without. Confidence intervals of 95% were calculated accordingly while answer-rate (AR) were added only to abstaining instances.
Figure 3: Numeric QA accuracy and answer-rate sub-labels analysis: (A) Answer distribution, (B) Medical Discipline, (C) Medical Subject type, (D) QA type, (E) Disorders Prevalence, (F) Question length.Red asterisks represent proportion p-values: .05$<$ *$<$.01, ***$<$.0001
Figure S1: Numeric QAs benchmark subanalysis according to medical and non-medical labels and sub-labels.
Figure S2: Distribution of the data and labels in the EBMQA: (A) Unique Medical Data Type, (B) Question Subject, (C) Medical Discipline, (D) Disorders Prevalence, (E) Question Type, (F) Question length.
...and 3 more figures

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

TL;DR

Abstract

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Authors

TL;DR

Abstract

Table of Contents

Figures (8)