MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
Kangkun Mao, Jinru Ding, Jiayuan Chen, Mouxiao Bian, Ruiyao Chen, Xinwei Peng, Sijie Ren, Linyang Li, Jie Xu
TL;DR
This work targets the critical gap in medical LLMs: accurate quantitative calculation for clinical decision support. It introduces MedCalc-Eval, the largest benchmark for medical calculation (700+ tasks across equation-based and rule-based scoring) to stress-test numerical reasoning across diverse specialties, and MedCalc-Env, a reinforcement learning environment (built on InternBootcamp) to train multi-step clinical reasoning. Fine-tuning Qwen2.5-32B within MedCalc-Env achieves state-of-the-art results on both MedCalc-Eval and MedCalc-Bench, with notable gains in numerical sensitivity, formula selection, and reasoning robustness, though challenges remain in unit conversions and multi-condition logic. The work also provides detailed error analyses and discusses future directions, including dynamic patient simulations, multi-modal inputs, explainability, and cross-lingual generalization, to advance dependable AI-powered clinical decision support.
Abstract
As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.
