MedEthicEval: Evaluating Large Language Models Based on Chinese Medical Ethics
Haoan Jin, Jiacheng Shi, Hanhui Xu, Kenny Q. Zhu, Mengyue Wu
TL;DR
MedEthicEval introduces a two-component benchmark (knowledge and application) with four datasets to systematically evaluate LLMs on Chinese medical ethics, including three novel application datasets for violation detection, priority dilemmas, and equilibrium dilemmas. It leverages public ethics knowledge sources and expert-informed taxonomies, plus Qwen2.5-driven data generation and multiple adversarial prompts to stress-test ethical reasoning. Across six models, Qwen2.5 generally shows strongest ethics knowledge and application, while LLaMa3-8B demonstrates notable ethical reasoning despite smaller size, and fine-tuning alone (as with HA) provides limited gains. The work highlights vulnerabilities to prompts like post-hoc justification and outlines practical implications for deploying AI in healthcare, while acknowledging cultural variability, evolving ethical challenges, and dataset size limitations.
Abstract
Large language models (LLMs) demonstrate significant potential in advancing medical applications, yet their capabilities in addressing medical ethics challenges remain underexplored. This paper introduces MedEthicEval, a novel benchmark designed to systematically evaluate LLMs in the domain of medical ethics. Our framework encompasses two key components: knowledge, assessing the models' grasp of medical ethics principles, and application, focusing on their ability to apply these principles across diverse scenarios. To support this benchmark, we consulted with medical ethics researchers and developed three datasets addressing distinct ethical challenges: blatant violations of medical ethics, priority dilemmas with clear inclinations, and equilibrium dilemmas without obvious resolutions. MedEthicEval serves as a critical tool for understanding LLMs' ethical reasoning in healthcare, paving the way for their responsible and effective use in medical contexts.
