MedConceptsQA: Open Source Medical Concepts QA Benchmark
Ofir Ben Shoham, Nadav Rappoport
TL;DR
MedConceptsQA introduces an open-source benchmark with over 800k questions covering medical concepts encoded by ICD9-CM, ICD10-CM, ICD9-PROC, ICD10-PROC, and ATC Drug codes across three difficulty levels to assess LLM understanding and reasoning. The study reveals that clinical LLMs trained on medical data perform near random on this benchmark, while GPT-4 substantially outperforms them in both zero-shot and few-shot settings, though not to perfect levels (approximately a 27–37 percentage point advantage over clinical LLMs). By providing an extensive, diverse evaluation suite and accompanying reproducibility resources, the work highlights persistent gaps in medical concept understanding and offers a platform for benchmarking future medical-domain LLMs and instruction-tuning approaches. The benchmark thus has practical implications for evaluating and guiding the development of more reliable medical reasoning in LLMs, with potential use in model training, evaluation, and health informatics research.
Abstract
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA
