Does Biomedical Training Lead to Better Medical Performance?

Amin Dada; Marie Bauer; Amanda Butler Contreras; Osman Alperen Koraş; Constantin Marc Seibold; Kaleb E Smith; Jens Kleesiek

Does Biomedical Training Lead to Better Medical Performance?

Amin Dada, Marie Bauer, Amanda Butler Contreras, Osman Alperen Koraş, Constantin Marc Seibold, Kaleb E Smith, Jens Kleesiek

TL;DR

This study interrogates whether biomedical fine-tuning improves medical performance by evaluating 24 models across six real-world clinical tasks within the CLUE benchmark. Across the results, most biomedically adapted models underperform relative to their base general-domain counterparts, especially on longer, more complex tasks where hallucinations and coding errors proliferate. Notably, only a weight-merging approach (BioMistral-DARE) shows consistent gains, while newer, larger general-domain models often excel. The findings challenge the assumed superiority of domain-specific training for practical healthcare tasks and provide an open-source evaluation framework to spur more robust, real-world benchmarking.

Abstract

Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.

Does Biomedical Training Lead to Better Medical Performance?

TL;DR

Abstract

models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.

Paper Structure (16 sections, 8 figures, 5 tables)

This paper contains 16 sections, 8 figures, 5 tables.

Introduction
Related Work
Evaluation Tasks
Experimental setup
Models
Results
Error Analysis
Discussion
Conclusion
Task Details
Metrics
Experimental setup
Computational Resources
Models
Prompting
...and 1 more sections

Figures (8)

Figure 1: Comparison of average scores between general-domain models and highest scoring biomedical models.
Figure 2: MeQSum prompt format with example.
Figure 3: Problem Summary prompt format with example.
Figure 4: MedNLI prompt format with example.
Figure 5: LongHealth prompt format.
...and 3 more figures

Does Biomedical Training Lead to Better Medical Performance?

TL;DR

Abstract

Does Biomedical Training Lead to Better Medical Performance?

Authors

TL;DR

Abstract

Table of Contents

Figures (8)