MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Tessa Han; Aounon Kumar; Chirag Agarwal; Himabindu Lakkaraju

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Tessa Han, Aounon Kumar, Chirag Agarwal, Himabindu Lakkaraju

TL;DR

This paper defines a domain-specific notion of medical safety for LLMs based on AMA Principles and introduces MedSafetyBench, a first-of-its-kind benchmark consisting of 1,800 harmful medical requests with safe responses. It demonstrates that publicly available medical LLMs exhibit safety gaps and that fine-tuning with MedSafetyBench can significantly improve medical safety while preserving core medical capabilities. The work provides a systematic framework for evaluating and aligning LLMs to medical ethics, accompanied by dataset, code, and validation via domain experts. This benchmark enables targeted, scalable improvements in the safe deployment of LLMs in clinical contexts and invites future work on nuanced safety standards across subspecialties.

Abstract

As large language models (LLMs) develop increasingly sophisticated capabilities and find applications in medical settings, it becomes important to assess their medical safety due to their far-reaching implications for personal and public health, patient safety, and human rights. However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association. We then leverage this understanding to introduce MedSafetyBench, the first benchmark dataset designed to measure the medical safety of LLMs. We demonstrate the utility of MedSafetyBench by using it to evaluate and improve the medical safety of LLMs. Our results show that publicly-available medical LLMs do not meet standards of medical safety and that fine-tuning them using MedSafetyBench improves their medical safety while preserving their medical performance. By introducing this new benchmark dataset, our work enables a systematic study of the state of medical safety in LLMs and motivates future work in this area, paving the way to mitigate the safety risks of LLMs in medicine. The benchmark dataset and code are available at https://github.com/AI4LIFE-GROUP/med-safety-bench.

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 16 figures, 4 tables)

This paper contains 21 sections, 16 figures, 4 tables.

Introduction
Related Work
MedSafetyBench: A Benchmark Dataset for the Medical Safety of LLMs
Defining Medical Safety for LLMs
Developing the Benchmark Dataset
Experiments
Evaluating the Medical Safety of LLMs
Improving the Medical Safety of LLMs
Discussion and Conclusion
MedSafetyBench
Developing harmful medical requests
Developing safe responses to the harmful medical requests
Evaluating the medical safety of LLMs
Evaluation details
Additional results
...and 6 more sections

Figures (16)

Figure 1: Contribution and findings. In this work, we define the notion of medical safety for LLMs, leverage this definition to develop a medical safety benchmark dataset, and use this benchmark to evaluate and improve the medical safety of LLMs. We find that 1) publicly-available medical LLMs do not meet standards of medical safety and that 2) fine-tuning these LLMs on medical safety demonstrations significantly improves their safety while preserving their medical performance.
Figure 2: Average harmfulness score for each LLM by harm dataset. On the x-axis, LLMs with safety alignment are indicated by an asterisk. Error bars indicate the standard error of the mean. The results indicate that medical LLMs readily comply with harmful general and medical requests, and they do so more frequently than their safety-aligned, general-knowledge counterparts. Thus, medical LLMs do not meet standards of general and medical safety.
Figure 3: Safety of medical LLMs before fine-tuning (red) and after fine-tuning (green) on safety demonstrations. Error bars indicate the standard error of the mean. Fine-tuning on safety demonstrations significantly improves the safety of original medical LLMs. This trend is consistent across medical LLMs (Medalpaca-7b, Medalpaca-13b, and ClinicalCamel-70b), across evaluation datasets (GenSafety-Eval, MedSafety-Eval-GPT4, MedSafety-Eval-Llama2), and across the types of safety demonstrations on which the model is fine-tuned (general, medical, or both).
Figure 4: Harmfulness score distributions for each LLM by harm dataset. LLMs that have been aligned to generate safe responses are indicated by an asterisk. The results indicate that medical LLMs readily comply with harmful general and medical requests, and they do so more frequently than their safety-aligned, general-knowledge counterparts. Thus, medical LLMs do not meet currently-achievable standards of general and medical safety.
Figure 5: Harmfulness score raw distributions for each LLM by harm dataset. LLMs that have been aligned to generate safe responses are indicated by an asterisk. The results indicate that for medical LLMs, many responses to general and medical harmful requests fully comply with the requests.
...and 11 more figures

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

TL;DR

Abstract

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)