Table of Contents
Fetching ...

MedEthicEval: Evaluating Large Language Models Based on Chinese Medical Ethics

Haoan Jin, Jiacheng Shi, Hanhui Xu, Kenny Q. Zhu, Mengyue Wu

TL;DR

MedEthicEval introduces a two-component benchmark (knowledge and application) with four datasets to systematically evaluate LLMs on Chinese medical ethics, including three novel application datasets for violation detection, priority dilemmas, and equilibrium dilemmas. It leverages public ethics knowledge sources and expert-informed taxonomies, plus Qwen2.5-driven data generation and multiple adversarial prompts to stress-test ethical reasoning. Across six models, Qwen2.5 generally shows strongest ethics knowledge and application, while LLaMa3-8B demonstrates notable ethical reasoning despite smaller size, and fine-tuning alone (as with HA) provides limited gains. The work highlights vulnerabilities to prompts like post-hoc justification and outlines practical implications for deploying AI in healthcare, while acknowledging cultural variability, evolving ethical challenges, and dataset size limitations.

Abstract

Large language models (LLMs) demonstrate significant potential in advancing medical applications, yet their capabilities in addressing medical ethics challenges remain underexplored. This paper introduces MedEthicEval, a novel benchmark designed to systematically evaluate LLMs in the domain of medical ethics. Our framework encompasses two key components: knowledge, assessing the models' grasp of medical ethics principles, and application, focusing on their ability to apply these principles across diverse scenarios. To support this benchmark, we consulted with medical ethics researchers and developed three datasets addressing distinct ethical challenges: blatant violations of medical ethics, priority dilemmas with clear inclinations, and equilibrium dilemmas without obvious resolutions. MedEthicEval serves as a critical tool for understanding LLMs' ethical reasoning in healthcare, paving the way for their responsible and effective use in medical contexts.

MedEthicEval: Evaluating Large Language Models Based on Chinese Medical Ethics

TL;DR

MedEthicEval introduces a two-component benchmark (knowledge and application) with four datasets to systematically evaluate LLMs on Chinese medical ethics, including three novel application datasets for violation detection, priority dilemmas, and equilibrium dilemmas. It leverages public ethics knowledge sources and expert-informed taxonomies, plus Qwen2.5-driven data generation and multiple adversarial prompts to stress-test ethical reasoning. Across six models, Qwen2.5 generally shows strongest ethics knowledge and application, while LLaMa3-8B demonstrates notable ethical reasoning despite smaller size, and fine-tuning alone (as with HA) provides limited gains. The work highlights vulnerabilities to prompts like post-hoc justification and outlines practical implications for deploying AI in healthcare, while acknowledging cultural variability, evolving ethical challenges, and dataset size limitations.

Abstract

Large language models (LLMs) demonstrate significant potential in advancing medical applications, yet their capabilities in addressing medical ethics challenges remain underexplored. This paper introduces MedEthicEval, a novel benchmark designed to systematically evaluate LLMs in the domain of medical ethics. Our framework encompasses two key components: knowledge, assessing the models' grasp of medical ethics principles, and application, focusing on their ability to apply these principles across diverse scenarios. To support this benchmark, we consulted with medical ethics researchers and developed three datasets addressing distinct ethical challenges: blatant violations of medical ethics, priority dilemmas with clear inclinations, and equilibrium dilemmas without obvious resolutions. MedEthicEval serves as a critical tool for understanding LLMs' ethical reasoning in healthcare, paving the way for their responsible and effective use in medical contexts.

Paper Structure

This paper contains 29 sections, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Overview of the MedEthicEval
  • Figure 2: A branch of the medical scenarios taxonomy. The full taxonomy can be found in the URL in the footnote.
  • Figure 3: A sample from the Detecting Violation subset of MedEthicEval.
  • Figure 4: Three subsets of the application evaluation. The blue objects on the scales represent specific medical ethics principles, and the tilt of the scales indicates the prioritization of one principle over another.
  • Figure 5: Comparison of GPT-4 and Qwen2.5 in generating violation scenarios for medical ethics. It can be observed that Qwen2.5 generates queries with more subtle violations of medical ethics, whereas GPT-4 presents more overtly clear ethical breaches.
  • ...and 10 more figures