Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

Yuqing Wang; Yun Zhao; Linda Petzold

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

Yuqing Wang, Yun Zhao, Linda Petzold

TL;DR

The paper investigates whether current large language models are ready for healthcare by benchmarking GPT-3.5, GPT-4, and Bard across diverse clinical language tasks. It introduces self-questioning prompting (SQP) to elicit targeted medical reasoning and demonstrates its effectiveness through extensive experiments and error analysis, particularly on relation extraction. Findings show GPT-4 generally leads on information-identification tasks, Bard excels on factoid QA, and SQP provides notable improvements over standard prompting, with 5-shot learning offering additional gains. The authors argue for careful, domain-informed deployment of LLMs in healthcare, emphasizing task-specific prompting, clinician collaboration, and human verification to realize safe and impactful clinical applications.

Abstract

Large language models (LLMs) have made significant progress in various domains, including healthcare. However, the specialized nature of clinical language understanding tasks presents unique challenges and limitations that warrant further investigation. In this study, we conduct a comprehensive evaluation of state-of-the-art LLMs, namely GPT-3.5, GPT-4, and Bard, within the realm of clinical language understanding tasks. These tasks span a diverse range, including named entity recognition, relation extraction, natural language inference, semantic textual similarity, document classification, and question-answering. We also introduce a novel prompting strategy, self-questioning prompting (SQP), tailored to enhance LLMs' performance by eliciting informative questions and answers pertinent to the clinical scenarios at hand. Our evaluation underscores the significance of task-specific learning strategies and prompting techniques for improving LLMs' effectiveness in healthcare-related tasks. Additionally, our in-depth error analysis on the challenging relation extraction task offers valuable insights into error distribution and potential avenues for improvement using SQP. Our study sheds light on the practical implications of employing LLMs in the specialized domain of healthcare, serving as a foundation for future research and the development of potential applications in healthcare settings.

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

TL;DR

Abstract

Paper Structure (25 sections, 4 figures, 6 tables)

This paper contains 25 sections, 4 figures, 6 tables.

Introduction
Related Work
Large Language Models in Healthcare
Prompting Strategies
Self-Questioning Prompting
Datasets
Experiments
Experimental Setup
Evaluation Procedure
Results
Overall Performance Comparison
Prompting Strategies Comparison
Task-by-Task Analysis
Case Study: Error Analysis
Discussion
...and 10 more sections

Figures (4)

Figure 1: Construction process of self-questioning prompting (SQP).
Figure 2: Self-questioning prompting (SQP) templates for six clinical language understanding tasks, with the core self-questioning process underscored and bolded. These components represent the generation of targeted questions and answers, guiding the model's reasoning and enhancing task performance.
Figure 3: Average performance comparison of three prompting methods in zero-shot and 5-shot learning settings across Bard, GPT-3.5, and GPT-4 models. Performance values are averaged across all datasets, assuming equal importance for datasets and evaluation metrics, as well as direct comparability. The self-questioning prompting method consistently outperforms standard and chain-of-thought prompting, and GPT-4 excels among the models.
Figure 4: Error correction examples using self-questioning prompting (SQP) for Bard, GPT-3.5, and GPT-4 in the SemEval 2013-DDI dataset, compared to standard prompting (StP). Each example showcases the top error for each model and how SQP addresses these challenges. As this paper primarily focuses on the effectiveness of SQP, chain-of-thought prompting is not presented in these examples.

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

TL;DR

Abstract

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)