Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Andrea Posada; Daniel Rueckert; Felix Meissen; Philip Müller

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Andrea Posada, Daniel Rueckert, Felix Meissen, Philip Müller

TL;DR

A comprehensive survey of language models in the medical field and evaluated a subset of these for medical text classification and conditional text generation, underscoring the potential of certain models to contain medical knowledge, even without domain specialization.

Abstract

Since the Transformer architecture emerged, language model development has grown, driven by their promising potential. Releasing these models into production requires properly understanding their behavior, particularly in sensitive domains like medicine. Despite this need, the medical literature still lacks practical assessment of pre-trained language models, which are especially valuable in settings where only consumer-grade computational resources are available. To address this gap, we have conducted a comprehensive survey of language models in the medical field and evaluated a subset of these for medical text classification and conditional text generation. The subset includes 53 models with 110 million to 13 billion parameters, spanning the Transformer-based model families and knowledge domains. Different approaches are employed for text classification, including zero-shot learning, enabling tuning without the need to train the model. These approaches are helpful in our target settings, where many users of language models find themselves. The results reveal remarkable performance across the tasks and datasets evaluated, underscoring the potential of certain models to contain medical knowledge, even without domain specialization. This study thus advocates for further exploration of model applications in medical contexts, particularly in computational resource-constrained settings, to benefit a wide range of users. The code is available on https://github.com/anpoc/Language-models-in-medicine.

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

TL;DR

Abstract

Paper Structure (39 sections, 1 equation, 21 figures, 6 tables)

This paper contains 39 sections, 1 equation, 21 figures, 6 tables.

Introduction
Preliminaries
Pre-trained language models
Encoder-only models
Decoder-only models
Encoder-decoder models
Large language models
Language models in the biomedical/clinical context
Related Work
Methodology
Text classification
Datasets
Approaches
Conditional text generation task
Dataset
...and 24 more sections

Figures (21)

Figure 1: Graphical representation of the three families of Transformer-based models: encoder-only, decoder-only, and encoder-decoder models. Colors signal the correspondence between outputs and targets. Encoder-only models are mainly used for discriminative tasks. Their input is tokenized, and some of these tokens are masked. They are then fed into Transformer blocks with self-attention to obtain contextualized output embeddings, which are further processed by next sentence prediction (NSP) and language model (LM) heads or used by downstream task-specific heads. Depending on the training objective, the NSP head may or may not be necessary. Decoder-only models focus on generation tasks. Their input is tokenized and fed to Transformer blocks with causal self-attention. The causal self-attention ensures that the information flows unidirectionally from left to right. Encoder-decoder models are used for text-to-text tasks. Their encoder processes the input text, similar to encoder-only models but excluding the NSP head, and flows information to the decoder via the cross-attention mechanism. This information is used with the target output so that the decoder learns to produce the latter generatively.
Figure 2: Highest model classification scores achieved by approach for the evaluated datasets. Each point corresponds to the mean of $1\,000$ bootstrap iterations. Error bars are calculated as three times the standard deviation of the mean. The highest-performing models are consistent across datasets: BioLORD models (m07-m09) for contextual embedding similarity, MNLI fine-tuned RoBERTa and BART (m12-m13) for NLI, and the largest instruction-tuned models within the T5 family (m20-m23) and instruction-tuned models within the LLaMA family (m39-m40, m52) for multiple-choice QA. Overall, the larger instruction-tuned T5 models emerge as the top performers. The correspondence between the model and ID is found in \ref{['tab:models']}.
Figure 3: Analysis of the impact of the logarithm of size on model performance. Model performance is defined as the highest performance achieved per model over the configurations evaluated. Due to either the lack of size diversity or the low number of samples, Spearman's coefficient, i.e., testing for monotonic relationships, is only reported for the multiple choice QA approach. An analysis of this coefficient suggests that there is not enough evidence to establish the statistical significance of the correlation, as reflected by the p-values.
Figure 4: Distributions of the impact of prompting on model performance. In contextual embedding similarity and NLI, the impact of prompting is quantified as the difference in performance resulting from prompt usage, with positive values indicating improvement. As the distributions reveal, its usage only sometimes enhances performance. In multiple-choice QA, the impact of prompting is calculated as the variation in performance, expressed in standard deviations, when using different prompts. Optimal scenarios entail non-extreme values, suggesting that there is no strong dependence of performance on prompt wording. The distributions unveil some significant prompt-sensitive models in this case. These distributions are cut to the minimum and maximum observed values to avoid misleading remarks.
Figure 5: Mean perplexity scores for the MIMIC-CXR dataset, disaggregated by BOS token usage. Each point corresponds to the mean of $1\,000$ bootstrap iterations. Error bars are calculated as three times the standard deviation of the mean. The highest-performers are the LlaMA models (m38-m39), whereas the lowest-performers are the BioGPT models (m48-m49). Not using the BOS token is beneficial for $77.78\%$ (14/18) of the models, with the exceptions of the GPT-2 models (m26-m28) and Palmyra Base 5B (m29). The correspondence between model and ID is found in \ref{['tab:models']}.
...and 16 more figures

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

TL;DR

Abstract

Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (21)