Table of Contents
Fetching ...

DeVisE: Behavioral Testing of Medical Large Language Models

Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto

TL;DR

DeVisE introduces a clinically grounded behavioral testing framework for medical large language models by constructing controlled counterfactuals for Demographics and Vital signs in ICU admission notes derived from the MIMIC-IV dataset. The framework evaluates models in zero-shot mortality and ICU length-of-stay tasks using both task-specific (downstream prediction shifts, directionality, and monotonicity) and task-independent (perplexity-based) metrics, revealing substantial model differences that standard benchmarks miss. Across eight diverse LLMs, template-based notes amplify sensitivity to vital-sign perturbations and expose demographic biases, while clinical-domain models tend to be more conservative in their reasoning. The study demonstrates that incorporating counterfactual behavioral testing provides deeper insights into clinical reasoning, robustness, and fairness, offering a path toward safer deployment of medical LLMs.

Abstract

Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted ICU length-of-stay and mortality. Overall, our results show that standard task metrics obscure clinically relevant differences in model behavior, with models differing substantially in how consistently and proportionally they adjust predictions to counterfactual perturbations.

DeVisE: Behavioral Testing of Medical Large Language Models

TL;DR

DeVisE introduces a clinically grounded behavioral testing framework for medical large language models by constructing controlled counterfactuals for Demographics and Vital signs in ICU admission notes derived from the MIMIC-IV dataset. The framework evaluates models in zero-shot mortality and ICU length-of-stay tasks using both task-specific (downstream prediction shifts, directionality, and monotonicity) and task-independent (perplexity-based) metrics, revealing substantial model differences that standard benchmarks miss. Across eight diverse LLMs, template-based notes amplify sensitivity to vital-sign perturbations and expose demographic biases, while clinical-domain models tend to be more conservative in their reasoning. The study demonstrates that incorporating counterfactual behavioral testing provides deeper insights into clinical reasoning, robustness, and fairness, offering a path toward safer deployment of medical LLMs.

Abstract

Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted ICU length-of-stay and mortality. Overall, our results show that standard task metrics obscure clinically relevant differences in model behavior, with models differing substantially in how consistently and proportionally they adjust predictions to counterfactual perturbations.

Paper Structure

This paper contains 47 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of DeVisE in 5 steps: (1) We create admission notes using MIMIC-IV discharge summaries; (2) We create counterfactuals for template-based and raw notes; (3--4) We use original and counterfactual notes to predict mortality and LoS in a zero-shot setting, and (5) We evaluate and compare LLMs in (5a) a task-specific setting---on how counterfactuals affect mortality and LoS probabilities---and in (5b) a task-independent setting, by comparing how LLM perplexity change between original and counterfactual examples.
  • Figure 2: Average per-token $\Delta$PPL across counterfactual severity shifts. $\Delta\texttt{PPL}$ grows with both increasing and decreasing severity, indicating consistent linguistic sensitivity. In (a) GPT-OSS-120B was excluded due to its substantially higher $\Delta\texttt{PPL}$ ($2.5\pm5.1$). We observe the same pattern in (a) and (b) with higher effects in template notes. Except GPT-OSS-120B that inversely, presents a smaller response in raw notes. GPT-4.1-mini is omitted because we could not obtain per-token log probabilities for this model.
  • Figure 3: Average KL across counterfactual severity shifts for LoS task (raw notes). Models show different behaviours. GPT-4.1-mini, GPT-OSS-120B and LLaMA-3.3-Instruct-70B show larger KL shifts than other models.
  • Figure 4: Expected $\Delta \mathbb{E}(y_i^\text{los}|\cdot)$ in hours (a) and probability of mortality $\Delta \mathbb{E}(y_i^\text{mort}|\cdot)$ (b) as a function of counterfactual severity shift. Positive severity shifts are expected to increase predicted LoS and mortality risk, while negative shifts are expected to decrease them. Most models follow a monotonic trend, indicating clinically aligned responses to vital sign counterfactuals. LLaMA-3.3-Instruct-70B was omitted due to its substantially larger response that dominates the scale, see § \ref{['app:additional-results']}.
  • Figure 5: $\Delta \mathbb{E}(y_i^\text{los}|\cdot)$ and Flips(%) by demographics. Age produces the largest and most significant shifts, while race and gender effects are smaller but consistent. Blank cells denote non-significant t-test comparisons (p<0.05).
  • ...and 2 more figures