MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Iñigo Alonso; Maite Oronoz; Rodrigo Agerri

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Iñigo Alonso, Maite Oronoz, Rodrigo Agerri

TL;DR

Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points.

Abstract

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by medical doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs still has large room for improvement, especially for languages other than English. Furthermore, and despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. So far the benchmark is available in four languages, but we hope that this work may encourage further development to other languages.

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

TL;DR

Abstract

Paper Structure (22 sections, 10 figures, 13 tables)

This paper contains 22 sections, 10 figures, 13 tables.

Introduction
Related Work
Materials and Methods
Models
Retrieval-Augmented Generation (RAG)
MedExpQA: A new multilingual benchmark for Medical QA
Antidote CasiMedicos Dataset
The MedExpQA Benchmark
Full Reference Gold Explanations
Explanation of the Incorrect Options
Full Gold Explanation with Explicit References Hidden
Experimental Setup
Evaluation
Results
Zero-shot results
...and 7 more sections

Figures (10)

Figure 1: Graphical description of the MedExpQA benchmark in which various types of gold and external medical knowledge are added to Large Language Models in order to find the correct answer in the CasiMedicos dataset.
Figure 2: Overview of averaged results in MedExpQA for gold and automatically knowledge grounding based on Retrieval Augmented Generation (RAG). E: gold explanations written by medical doctors; H: E with explicit references to the possible answers hidden; and EI: gold explanations about the incorrect options; RAG-32: automatically retrieved knowledge grounding (details in Section \ref{['sec:experimental']}); no-grounding: baseline model with no external knowledge.
Figure 3: Distribution of correct answers in the train, validation and test splits. The percentage in blue indicates the proportion of exams with the first option, number 1, as correct answer; orange corresponds to option 2; yellow to option 3; green to option 4; and brown to option 5. Note that not every document includes 5 possible options.
Figure 4: Distribution of retrieved documents across different context windows. Three different histograms are shown that depict the maximum number of documents that can be accommodated within various context windows across dataset examples: 2,048 tokens (PMC-LLaMA), 4,096 tokens (LLaMA2), and 8,192 tokens (Mistral and BioMistral).
Figure 5: Performance of different models in a zero-shot setting with up to 0, 2, 4, 8, 16, and 32 retrieved snippets.
...and 5 more figures

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

TL;DR

Abstract

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (10)