M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Anand Subramanian; Viktor Schlegel; Abhinav Ramesh Kashyap; Thanh-Tung Nguyen; Vijay Prakash Dwivedi; Stefan Winkler

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Anand Subramanian, Viktor Schlegel, Abhinav Ramesh Kashyap, Thanh-Tung Nguyen, Vijay Prakash Dwivedi, Stefan Winkler

TL;DR

The paper introduces m-qalm, a large, open-source benchmark of 22 clinical QA datasets designed to assess medical knowledge recall and reading comprehension in LLMs. It conducts a comprehensive evaluation of 15 open-source LLMs in zero-shot and fine-tuned regimes across MCQA and AQA tasks, supplemented by manual error analysis and generalization checks. Key findings show instruction tuning and domain-focused fine-tuning can improve performance and generalization to some unseen data, but open-domain medical LLMs still trail human experts and proprietary systems, and current AQA metrics exhibit reliability concerns. By releasing the dataset, methodology, and evaluation protocol, the work provides a standardized framework to advance clinical knowledge representation learning in LLMs while highlighting limitations and avenues for future research.

Abstract

There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented context. To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models.

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

TL;DR

Abstract

Paper Structure (16 sections, 41 figures, 19 tables)

This paper contains 16 sections, 41 figures, 19 tables.

Related Work
m-qalm Datasets
Empirical Evaluation
Study Setup
Results and Analysis
Zero-shot Evaluation Results
Impact of Fine-tuning
Error Analysis
Category-wise and Manual Error Analysis
Error Analysis of LLama-2
Conclusions
Datasets Used
Performance of other methods for mcqa datasets
Correlation between aqa and mcqa metrics
Analysis of the causes of generalisation to unseen datasets
...and 1 more sections

Figures (41)

Figure 1: Performance of base and aqa-fine-tuned LLaMA 2 and Flan-T5 models on unseen aqa test sets.
Figure 2: Performance of base, mcqa-tuned, and aqa-tuned LLaMA 2 model on unseen mcqa test sets.
Figure 3: Sample questions corresponding to each category of the manual error analysis.
Figure 4: Zero-shot performance of models on mcqa (top-left) and aqa (top-right, bottom-left and bottom-right) as a function of model size. The dashed line represents a fitted linear regression showing the correlation between the model size and the score.
Figure 5: Performance of base and aqa-finetuned LLaMA 2 and Flan-T5 models on four unseen aqa test sets in terms of ROUGE-L.
...and 36 more figures

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

TL;DR

Abstract

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (41)