CHBench: A Chinese Dataset for Evaluating Health in Large Language Models
Chenlu Guo, Nuo Xu, Yi Chang, Yuan Wu
TL;DR
This work tackles the challenge of safe, trustworthy health advice from Chinese LLMs by introducing CHBench, a health-focused benchmark with 2,999 physical health and 6,493 mental health entries sourced from web posts, exams, and existing datasets. Gold-standard responses are generated by ERNIE Bot and evaluated against five Chinese LLMs using cosine and Jaccard similarity metrics, revealing notable safety and accuracy gaps across models. The dataset employs a multi-source construction, rigorous annotation, and a comprehensive evaluation protocol to quantify model alignment with safety-oriented health guidance. CHBench provides a valuable, standardized resource to drive the development of safer, more reliable Chinese health LLMs and to facilitate reproducible, large-scale safety assessments.
Abstract
With the rapid development of large language models (LLMs), assessing their performance on health-related inquiries has become increasingly essential. The use of these models in real-world contexts-where misinformation can lead to serious consequences for individuals seeking medical advice and support-necessitates a rigorous focus on safety and trustworthiness. In this work, we introduce CHBench, the first comprehensive safety-oriented Chinese health-related benchmark designed to evaluate LLMs' capabilities in understanding and addressing physical and mental health issues with a safety perspective across diverse scenarios. CHBench comprises 6,493 entries on mental health and 2,999 entries on physical health, spanning a wide range of topics. Our extensive evaluations of four popular Chinese LLMs highlight significant gaps in their capacity to deliver safe and accurate health information, underscoring the urgent need for further advancements in this critical domain. The code is available at https://github.com/TracyGuo2001/CHBench.
