Large Language Models in the Clinic: A Comprehensive Benchmark

Fenglin Liu; Zheng Li; Hongjian Zhou; Qingyu Yin; Jingfeng Yang; Xianfeng Tang; Chen Luo; Ming Zeng; Haoming Jiang; Yifan Gao; Priyanka Nigam; Sreyashi Nag; Bing Yin; Yining Hua; Xuan Zhou; Omid Rohanian; Anshul Thakur; Lei Clifton; David A. Clifton

Large Language Models in the Clinic: A Comprehensive Benchmark

Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

TL;DR

This work constructs six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis, and invites medical experts to evaluate the clinical usefulness of LLMs.

Abstract

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench.

Large Language Models in the Clinic: A Comprehensive Benchmark

TL;DR

Abstract

Paper Structure (24 sections, 4 figures, 8 tables)

This paper contains 24 sections, 4 figures, 8 tables.

Introduction
Key Findings
ClinicBench
Discussion
Results
Settings
Automatic Evaluation
Clinical Task Analysis
Few-shot Analysis
Human Evaluation
Effect of Instruction Fine-tuning Data
Qualitative Analysis
Conclusions
Machine Learning Tasks
Question Answering
...and 9 more sections

Figures (4)

Figure 1: Overview of our ClinicBench, which includes 22 LLMs, 11 tasks, 17 datasets, and multiple metrics across automatic and human evaluations.
Figure 2: Comparison of LLMs' performance on machine learning and clinical tasks. When applied to clinical tasks, the performance drops of the LLMs are shown with the solid line and the right y-axis. Lower is better.
Figure 3: Performance of representative LLMs under the few-shot (1,3,5-shot) learning settings.
Figure 4: We present an example of patient education generated by different models to analyze the impact of instruction fine-tuning data.

Large Language Models in the Clinic: A Comprehensive Benchmark

TL;DR

Abstract

Large Language Models in the Clinic: A Comprehensive Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (4)