Table of Contents
Fetching ...

47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations

Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song, Xinyuan Song, Ziqian Bi

TL;DR

This study benchmarks 27 LLMs on a newly constructed 2,800-question Chinese medical licensing exam dataset spanning seven specialties and two professional levels. It finds Mixtral-8x7B to achieve the top overall accuracy (74.25%), and shows that mixture-of-experts architectures consistently outperform dense models by about 18%, while model size correlates weakly with performance. The results reveal strong cross-specialty variation and limited performance gains across professional levels, highlighting architectural design and domain adaptation as key drivers for medical QA. The findings inform medical education and clinical decision support deployment, underscoring the need for careful validation, safety considerations, and future research in modular architectures and domain-aware evaluation frameworks.

Abstract

The rapid advancement of large language models(LLMs) has prompted significant interest in their potential applications in medical domains. This paper presents a comprehensive benchmark evaluation of 27 state-of-the-art LLMs on Chinese medical examination questions, encompassing seven medical specialties across two professional levels. We introduce a robust evaluation framework that assesses model performance on 2,800 carefully curated questions from cardiovascular, gastroenterology, hematology, infectious diseases, nephrology, neurology, and respiratory medicine domains. Our dataset distinguishes between attending physician and senior physician difficulty levels, providing nuanced insights into model capabilities across varying complexity. Our empirical analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%, followed by DeepSeek-R1-671B at 64.07%. Notably, we observe no consistent correlation between model size and performance, as evidenced by the strong performance of smaller mixture-of-experts architectures. The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions compared to gastroenterology and nephrology domains. Furthermore, our analysis indicates minimal performance degradation between attending and senior physician levels for top-performing models, suggesting robust generalization capabilities. This benchmark provides critical insights for the deployment of LLMs in medical education and clinical decision support systems, highlighting both the promise and current limitations of these technologies in specialized medical contexts.

47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations

TL;DR

This study benchmarks 27 LLMs on a newly constructed 2,800-question Chinese medical licensing exam dataset spanning seven specialties and two professional levels. It finds Mixtral-8x7B to achieve the top overall accuracy (74.25%), and shows that mixture-of-experts architectures consistently outperform dense models by about 18%, while model size correlates weakly with performance. The results reveal strong cross-specialty variation and limited performance gains across professional levels, highlighting architectural design and domain adaptation as key drivers for medical QA. The findings inform medical education and clinical decision support deployment, underscoring the need for careful validation, safety considerations, and future research in modular architectures and domain-aware evaluation frameworks.

Abstract

The rapid advancement of large language models(LLMs) has prompted significant interest in their potential applications in medical domains. This paper presents a comprehensive benchmark evaluation of 27 state-of-the-art LLMs on Chinese medical examination questions, encompassing seven medical specialties across two professional levels. We introduce a robust evaluation framework that assesses model performance on 2,800 carefully curated questions from cardiovascular, gastroenterology, hematology, infectious diseases, nephrology, neurology, and respiratory medicine domains. Our dataset distinguishes between attending physician and senior physician difficulty levels, providing nuanced insights into model capabilities across varying complexity. Our empirical analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%, followed by DeepSeek-R1-671B at 64.07%. Notably, we observe no consistent correlation between model size and performance, as evidenced by the strong performance of smaller mixture-of-experts architectures. The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions compared to gastroenterology and nephrology domains. Furthermore, our analysis indicates minimal performance degradation between attending and senior physician levels for top-performing models, suggesting robust generalization capabilities. This benchmark provides critical insights for the deployment of LLMs in medical education and clinical decision support systems, highlighting both the promise and current limitations of these technologies in specialized medical contexts.

Paper Structure

This paper contains 23 sections, 7 figures.

Figures (7)

  • Figure 1: Radar chart comparing top 5 models across medical specialties with larger areas indicating better overall performance.
  • Figure 2: Overall model performance across all medical specialties and professional levels, ranked by average accuracy with 95% confidence intervals.
  • Figure 3: Performance heatmap showing accuracy percentages for top 15 models across all medical specialties and professional levels.
  • Figure 4: Comparison of model performance between attending and senior physician examination levels for top 10 models.
  • Figure 5: Relationship between model size (log scale) and performance, with models achieving >50% accuracy highlighted.
  • ...and 2 more figures