DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto
TL;DR
DeVisE introduces a clinically grounded behavioral testing framework for medical large language models by constructing controlled counterfactuals for Demographics and Vital signs in ICU admission notes derived from the MIMIC-IV dataset. The framework evaluates models in zero-shot mortality and ICU length-of-stay tasks using both task-specific (downstream prediction shifts, directionality, and monotonicity) and task-independent (perplexity-based) metrics, revealing substantial model differences that standard benchmarks miss. Across eight diverse LLMs, template-based notes amplify sensitivity to vital-sign perturbations and expose demographic biases, while clinical-domain models tend to be more conservative in their reasoning. The study demonstrates that incorporating counterfactual behavioral testing provides deeper insights into clinical reasoning, robustness, and fairness, offering a path toward safer deployment of medical LLMs.
Abstract
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted ICU length-of-stay and mortality. Overall, our results show that standard task metrics obscure clinically relevant differences in model behavior, with models differing substantially in how consistently and proportionally they adjust predictions to counterfactual perturbations.
