Table of Contents
Fetching ...

MedConceptsQA: Open Source Medical Concepts QA Benchmark

Ofir Ben Shoham, Nadav Rappoport

TL;DR

MedConceptsQA introduces an open-source benchmark with over 800k questions covering medical concepts encoded by ICD9-CM, ICD10-CM, ICD9-PROC, ICD10-PROC, and ATC Drug codes across three difficulty levels to assess LLM understanding and reasoning. The study reveals that clinical LLMs trained on medical data perform near random on this benchmark, while GPT-4 substantially outperforms them in both zero-shot and few-shot settings, though not to perfect levels (approximately a 27–37 percentage point advantage over clinical LLMs). By providing an extensive, diverse evaluation suite and accompanying reproducibility resources, the work highlights persistent gaps in medical concept understanding and offers a platform for benchmarking future medical-domain LLMs and instruction-tuning approaches. The benchmark thus has practical implications for evaluating and guiding the development of more reliable medical reasoning in LLMs, with potential use in model training, evaluation, and health informatics research.

Abstract

We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA

MedConceptsQA: Open Source Medical Concepts QA Benchmark

TL;DR

MedConceptsQA introduces an open-source benchmark with over 800k questions covering medical concepts encoded by ICD9-CM, ICD10-CM, ICD9-PROC, ICD10-PROC, and ATC Drug codes across three difficulty levels to assess LLM understanding and reasoning. The study reveals that clinical LLMs trained on medical data perform near random on this benchmark, while GPT-4 substantially outperforms them in both zero-shot and few-shot settings, though not to perfect levels (approximately a 27–37 percentage point advantage over clinical LLMs). By providing an extensive, diverse evaluation suite and accompanying reproducibility resources, the work highlights persistent gaps in medical concept understanding and offers a platform for benchmarking future medical-domain LLMs and instruction-tuning approaches. The benchmark thus has practical implications for evaluating and guiding the development of more reliable medical reasoning in LLMs, with potential use in model training, evaluation, and health informatics research.

Abstract

We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA
Paper Structure (14 sections, 6 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Questions across three difficulty levels for the S46.211D medical code within the ICD10-CM vocabulary.
  • Figure 2: Examples of zero-shot and few-shot learning prompts.
  • Figure 3: Aggregated results for zero-shot and few-shot evaluations. The vertical line represents the accuracy of random guessing for comparison.
  • Figure 4: Zero-shot and few-shot results for each of the levels (easy, medium, and hard) with 95% confidence intervals over three runs. Results are aggregated over difficulty vocabularies.
  • Figure 5: Zero-shot and few-shot results for each of the vocabularies with 95% confidence intervals over three runs. Results are aggregated over difficulty levels.
  • ...and 1 more figures