Towards Evaluating and Building Versatile Large Language Models for Medicine

Chaoyi Wu; Pengcheng Qiu; Jinxin Liu; Hongfei Gu; Na Li; Ya Zhang; Yanfeng Wang; Weidi Xie

Towards Evaluating and Building Versatile Large Language Models for Medicine

Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

A comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench is presented, spanning 11 high-level clinical tasks, and the resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks.

Abstract

In this study, we present MedS-Bench, a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. Unlike existing benchmarks that focus on multiple-choice question answering, MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation, among others. We evaluated six leading LLMs, e.g., MEDITRON, Mistral, InternLM 2, Llama 3, GPT-4, and Claude-3.5 using few-shot prompting, and found that even the most sophisticated models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models across nearly all clinical tasks. To promote further advancements in the application of LLMs to clinical challenges, we have made the MedS-Ins dataset fully accessible and invite the research community to contribute to its expansion.Additionally, we have launched a dynamic leaderboard for MedS-Bench, which we plan to regularly update the test set to track progress and enhance the adaptation of general LLMs to the medical domain. Leaderboard: https://henrychur.github.io/MedS-Bench/. Github: https://github.com/MAGIC-AI4Med/MedS-Ins.

Towards Evaluating and Building Versatile Large Language Models for Medicine

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 3 figures, 8 tables)

This paper contains 17 sections, 2 equations, 3 figures, 8 tables.

Introduction
Results
The Description of MedS-Bench
The Description of MedS-Ins
Quantitative Results on Various Tasks
Discussion
Methods
Data Collection
Model Training
Baselines
Metrics
Conclusion
Supplementary
Evaluation Settings
Task Category Details
...and 2 more sections

Figures (3)

Figure 1: Benchmark Statistics. The hierarchical ring chart meticulously displays the data distribution within the evaluation benchmarks. The first tier categorizes the types of tasks, with the benchmarks encompassing 11 primary task categories. The second tier outlines the datasets involved, including 28 datasets in total. The third tier details the specific tasks, with the benchmarks collectively addressing 52 distinct tasks. Overall, this benchmark allows for a thorough and comprehensive evaluation of model performance across multiple dimensions.
Figure 2: Overview of MedS-Ins.a The task collection pipeline. For each task, we add a task category along with a hand-written definition to it, resulting in a total of 19 task categories. b We collect the existing 58 public datasets. c We convert the formats of different datasets into one unified medical instruction dataset, MedS-Ins. d The final data distribution of our collected MedS-Ins. The Sankey diagram shows how the different text domains (left), task categories (middle), and data sources (right) contribute to the final datasets. On the left of the bottom, two pie charts show the data distributions on text domains and task categories respectively.
Figure 3: The pipeline of our method.a The data collection pipeline. We mainly collect data through filtering natural instructions and prompting well-organized BioNLP datasets. b The training and evaluation pipeline for our model leveraging the collected MedS-Ins. We leverage the instruction tuning training method to combine different datasets and evaluate the final model on multiple benchmarks comprehensively.

Towards Evaluating and Building Versatile Large Language Models for Medicine

TL;DR

Abstract

Towards Evaluating and Building Versatile Large Language Models for Medicine

Authors

TL;DR

Abstract

Table of Contents

Figures (3)