Table of Contents
Fetching ...

Assessing The Potential Of Mid-Sized Language Models For Clinical QA

Elliot Bolton, Betty Xiong, Vijaytha Muralidharan, Joel Schamroth, Vivek Muralidharan, Christopher D. Manning, Roxana Daneshjou

TL;DR

This study evaluates open-source mid-sized LLMs for clinical QA by benchmarking BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B on MedQA and MultiMedQA Long Form. A standardized training and evaluation protocol, including hyperparameter sweeps and clinician reviews, enables fair comparison and robust assessment of real-world suitability. Mistral 7B achieves the best MedQA performance (63.0% after extra MedMCQA training) and generally dominates the long-form task in clinician ratings, though production readiness is constrained by errors and hallucinations. The findings underscore the potential of on-device, open-source models for clinical QA while highlighting the need for larger, more data-rich, and retrieval-augmented approaches before deployment in healthcare settings.

Abstract

Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.

Assessing The Potential Of Mid-Sized Language Models For Clinical QA

TL;DR

This study evaluates open-source mid-sized LLMs for clinical QA by benchmarking BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B on MedQA and MultiMedQA Long Form. A standardized training and evaluation protocol, including hyperparameter sweeps and clinician reviews, enables fair comparison and robust assessment of real-world suitability. Mistral 7B achieves the best MedQA performance (63.0% after extra MedMCQA training) and generally dominates the long-form task in clinician ratings, though production readiness is constrained by errors and hallucinations. The findings underscore the potential of on-device, open-source models for clinical QA while highlighting the need for larger, more data-rich, and retrieval-augmented approaches before deployment in healthcare settings.

Abstract

Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.
Paper Structure (33 sections, 2 figures, 21 tables)