Table of Contents
Fetching ...

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre

TL;DR

CareMedEval targets grounded critical appraisal in biomedical literature by benchmarking LLMs against authentic French medical exam questions derived from 37 articles. The dataset comprises 534 MCQs with multi-label reasoning required, and is evaluated using EMR, F1, Hamming, and a custom LCA score to reflect exam-style grading. Results show that even strong models struggle with study design and statistical reasoning, though providing full-text article context and explicit reasoning traces improves performance. The study highlights current LLM limitations in critical appraisal and suggests directions for future development, including reasoning-aware prompts and vision-enabled extensions for figures in biomedical papers.

Abstract

Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

TL;DR

CareMedEval targets grounded critical appraisal in biomedical literature by benchmarking LLMs against authentic French medical exam questions derived from 37 articles. The dataset comprises 534 MCQs with multi-label reasoning required, and is evaluated using EMR, F1, Hamming, and a custom LCA score to reflect exam-style grading. Results show that even strong models struggle with study design and statistical reasoning, though providing full-text article context and explicit reasoning traces improves performance. The study highlights current LLM limitations in critical appraisal and suggests directions for future development, including reasoning-aware prompts and vision-enabled extensions for figures in biomedical papers.

Abstract

Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

Paper Structure

This paper contains 24 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example from the dataset showing an excerpt from a scientific article, the given instruction prompt with a corresponding question, and answer choices.
  • Figure 2: Overview of the model evaluation pipeline of the CareMedEval benchmark. The input consists of a zero-shot instruction prompt containing a question and possible answer choices, along with article (plain text only in our experiment setting). The model generates predicted answers, which are then evaluated using a set of quantitative metrics to assess performance.
  • Figure 3: Exact Match Ratio comparison across different evaluation scenarios, illustrating model performance when provided with the full article, only the abstract, or no context (only the question and answer options with the instruction prompt).
  • Figure 4: Heatmap of Exact Match Ratio by model and label for the MCQA task, illustrating performance differences across reasoning categories in the critical appraisal of scientific articles. Labels correspond to distinct cognitive skills required to answer the questions as described in \ref{['tab:reasoning_skills']}.