Table of Contents
Fetching ...

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

Daniel P. Jeong, Saurabh Garg, Zachary C. Lipton, Michael Oberst

TL;DR

It is suggested that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and recommendations to strengthen the conclusions of future studies are offered.

Abstract

Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

TL;DR

It is suggested that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and recommendations to strengthen the conclusions of future studies are offered.

Abstract

Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.

Paper Structure

This paper contains 42 sections, 7 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Medical LLMs and VLMs trained via domain-adaptive pretraining (DAPT) show limited improvement over their general-domain counterparts. (a) Overview of our head-to-head evaluation approach for each pair of general-domain (blue) and medically adapted LLM/VLM (red). (b) Win/tie/loss rate (%) of medical models vs. their corresponding base models across all (model pair, QA dataset) combinations. Win rate refers to the proportion of (model pair, QA dataset) combinations where a medical model shows a statistically significant improvement.
  • Figure 2: Overview of the prompt format sampling (left) and prompting strategy selection (right) process.
  • Figure 3: Medical LLMs do not consistently show a statistically significant improvement over their general-domain counterparts in the 3-shot setting, after independently selecting the best prompt format and examples for each model. Top row shows the absolute exact-match accuracies on the test set, and bottom row shows the relative exact-match accuracies along with 95% confidence intervals derived via bootstrapping on the test set (see Section \ref{['sec:eval-setup']}). Here, we show the results for greedy decoding. The 3-shot results for constrained decoding are similar (see Figure \ref{['fig:llm-logprob-acc-ci']}(b)).
  • Figure 4: Medical VLMs do not show a statistically significant improvement over their general-domain counterparts in the (a) zero-shot and (b) 3-shot settings, after independently selecting the best prompt format and examples for each model. Top row shows the absolute exact-match accuracies on the test set, and bottom row shows the relative exact-match accuracies along with 95% confidence intervals derived via bootstrapping on the test set (see Section \ref{['sec:eval-setup']}). Here, we show the results for greedy decoding. The results for constrained decoding are similar (see Figure \ref{['fig:vlm-logprob-acc-ci']}).
  • Figure 5: Optimizing the prompt for only the medical model and comparing models without accounting for statistical uncertainty can overestimate the performance improvements from medical DAPT. We show the win/tie/loss rate (%) of medical models vs. their base models across all (model pair, QA dataset) combinations, when (a) independently optimizing the prompt for each model and performing statistical testing, (b) optimizing the prompt only for the medical model and performing statistical testing, (c) independently optimizing the prompt for each model without statistical testing, and (d) optimizing the prompt only for the medical model without statistical testing. Here, we show the results for greedy decoding. The results for constrained decoding are similar (see Figure \ref{['fig:opt-logprob-ci-acc']}).
  • ...and 8 more figures