Table of Contents
Fetching ...

MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Kangkun Mao, Jinru Ding, Jiayuan Chen, Mouxiao Bian, Ruiyao Chen, Xinwei Peng, Sijie Ren, Linyang Li, Jie Xu

TL;DR

This work targets the critical gap in medical LLMs: accurate quantitative calculation for clinical decision support. It introduces MedCalc-Eval, the largest benchmark for medical calculation (700+ tasks across equation-based and rule-based scoring) to stress-test numerical reasoning across diverse specialties, and MedCalc-Env, a reinforcement learning environment (built on InternBootcamp) to train multi-step clinical reasoning. Fine-tuning Qwen2.5-32B within MedCalc-Env achieves state-of-the-art results on both MedCalc-Eval and MedCalc-Bench, with notable gains in numerical sensitivity, formula selection, and reasoning robustness, though challenges remain in unit conversions and multi-condition logic. The work also provides detailed error analyses and discusses future directions, including dynamic patient simulations, multi-modal inputs, explainability, and cross-lingual generalization, to advance dependable AI-powered clinical decision support.

Abstract

As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.

MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

TL;DR

This work targets the critical gap in medical LLMs: accurate quantitative calculation for clinical decision support. It introduces MedCalc-Eval, the largest benchmark for medical calculation (700+ tasks across equation-based and rule-based scoring) to stress-test numerical reasoning across diverse specialties, and MedCalc-Env, a reinforcement learning environment (built on InternBootcamp) to train multi-step clinical reasoning. Fine-tuning Qwen2.5-32B within MedCalc-Env achieves state-of-the-art results on both MedCalc-Eval and MedCalc-Bench, with notable gains in numerical sensitivity, formula selection, and reasoning robustness, though challenges remain in unit conversions and multi-condition logic. The work also provides detailed error analyses and discusses future directions, including dynamic patient simulations, multi-modal inputs, explainability, and cross-lingual generalization, to advance dependable AI-powered clinical decision support.

Abstract

As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.

Paper Structure

This paper contains 42 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of the MedCalc-Env training framework and the MedCalc-Eval evaluation process. The left side illustrates the reinforcement learning-based MedCalc-Env training loop: (1) The Case Generator samples from the task database to create clinical calculation cases with ground truth answers; (2) The Prompt Function formats the case into an input for the LLM; (3) The LLM generates reasoning steps and a final answer; (4) The Verification Function compares the LLM's answer with the ground truth to generate a reward signal; and (5) The RL algorithm uses this reward signal to update the LLM's model weights. This cycle repeats continuously to enhance the model's capabilities. The right side shows the evaluation process: after full training, the final model is tested on the independent and comprehensive MedCalc-Eval benchmark to objectively measure its final performance and generalization ability on medical calculation tasks.
  • Figure 2: Top 10 categories in MedCalc-Eval
  • Figure 3: Performance of LLMs on MedCalc-Eval compared to MedCalc-Bench
  • Figure 4: The complete model response for the scale-based question case study.
  • Figure 5: The complete model response for the formula-based question case study.
  • ...and 2 more figures