KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations
Sunjun Kweon, Byungjin Choi, Gyouk Chu, Junyeong Song, Daeun Hyeon, Sujin Gan, Jueon Kim, Minkyu Kim, Rae Woong Park, Edward Choi
TL;DR
KorMedMCQA introduces a Korean-context, multi-profession medical licensing MCQA benchmark derived from official exams (2012–2024) and evaluates 59 LLMs, revealing that Chain-of-Thought prompting can improve accuracy by up to 4.5% and that MedQA is not a reliable proxy for Korea’s linguistic, regulatory, and clinical environment. The study systematically analyzes data collection, splits, and subject coverage, and shows substantial cross-domain and cross-language transfer limitations when applying English-centric or non-Korean models to Korean medical QA. It highlights the necessity of region-specific benchmarks to accurately gauge real-world clinical performance and presents a detailed error taxonomy for CoT outputs. The KorMedMCQA dataset and evaluation tools are publicly available on HuggingFace to accelerate Korean healthcare AI research and development.
Abstract
We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark, derived from professional healthcare licensing examinations conducted in Korea between 2012 and 2024. The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist, covering a wide range of medical disciplines. We evaluate the performance of 59 large language models, spanning proprietary and open-source models, multilingual and Korean-specialized models, and those fine-tuned for clinical applications. Our results show that applying Chain of Thought (CoT) reasoning can enhance the model performance by up to 4.5% compared to direct answering approaches. We also investigate whether MedQA, one of the most widely used medical benchmarks derived from the U.S. Medical Licensing Examination, can serve as a reliable proxy for evaluating model performance in other regions-in this case, Korea. Our correlation analysis between model scores on KorMedMCQA and MedQA reveals that these two benchmarks align no better than benchmarks from entirely different domains (e.g., MedQA and MMLU-Pro). This finding underscores the substantial linguistic and clinical differences between Korean and U.S. medical contexts, reinforcing the need for region-specific medical QA benchmarks. To support ongoing research in Korean healthcare AI, we publicly release the KorMedMCQA via Huggingface.
