Table of Contents
Fetching ...

TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine

Tianai Huang, Jiayuan Chen, Lu Lu, Pengcheng Chen, Tianbin Li, Bing Han, Wenchao Tang, Jie Xu, Ming Li

TL;DR

This work introduces TCM-5CEval, a five-dimension benchmark for evaluating LLMs in Traditional Chinese Medicine, expanding the prior 3C framework to Core Knowledge, Classical Literacy, Clinical Decision-Making, Chinese Materia Medica, and Clinical Non-pharmacological Therapy. It constructs five textbook-based subdatasets and applies a dual-path evaluation (objective accuracy and open-ended scoring) plus a permutation-based consistency test to measure robustness. Results show strong performance on foundational knowledge but weaker interpretive and reasoning capabilities, with notable sensitivity to option ordering across models. The study provides a public MedBench platform for standardized, multi-dimension benchmarking and outlines future directions toward real-world data, disambiguation mechanisms, and multi-modal evaluation.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek\_r1 and gemini\_2\_5\_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.

TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine

TL;DR

This work introduces TCM-5CEval, a five-dimension benchmark for evaluating LLMs in Traditional Chinese Medicine, expanding the prior 3C framework to Core Knowledge, Classical Literacy, Clinical Decision-Making, Chinese Materia Medica, and Clinical Non-pharmacological Therapy. It constructs five textbook-based subdatasets and applies a dual-path evaluation (objective accuracy and open-ended scoring) plus a permutation-based consistency test to measure robustness. Results show strong performance on foundational knowledge but weaker interpretive and reasoning capabilities, with notable sensitivity to option ordering across models. The study provides a public MedBench platform for standardized, multi-dimension benchmarking and outlines future directions toward real-world data, disambiguation mechanisms, and multi-modal evaluation.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek\_r1 and gemini\_2\_5\_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.

Paper Structure

This paper contains 17 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview diagram of TCM 5C-EVAL
  • Figure 2: TCM-LLM Multi-Metric Assessment Workflow
  • Figure 3: Model Performance on the Single-Choice Permutation Consistency Test. The Figure compares the standard accuracy on the original questions ('Single Question') with the consistency-based accuracy ('ID Group') for each model across the five sub-datasets. The 'ID Group' score is awarded only when a model correctly answers a question across all five cyclical permutations of its options, thus measuring its robustness against option-order bias.
  • Figure 4: Performance Distribution of Leading Models by Sub-Dimension. ★: The model that performs best in this sub-dimension
  • Figure 5: High-Frequency Errors in Question Sets (A. Single-Choice; B. Multiple-Choice; C. Open-Ended)
  • ...and 3 more figures