Table of Contents
Fetching ...

ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks

Yinghao Zhu, Junyi Gao, Zixiang Wang, Weibin Liao, Xiaochen Zheng, Lifang Liang, Miguel O. Bernabeu, Yasha Wang, Lequan Yu, Chengwei Pan, Ewen M. Harrison, Liantao Ma

TL;DR

ClinicRealm provides a rigorous benchmark showing that modern large language models can rival or exceed specialized approaches for non-generative clinical prediction, especially with unstructured notes. Across mortality, readmission, and LOS tasks, zero-shot and data-efficient prompting enable strong performance, sometimes matching or surpassing proprietary models, with open-source LLMs contributing to accessible on-premise deployment. Multimodal integration yields nuanced outcomes, highlighting challenges in synthesizing structured EHR data with clinical narratives without task-specific finetuning. The study also emphasizes fairness and trustworthiness, demonstrating promising fairness signals in zero-shot prompting while identifying failure modes and deployment considerations. Overall, the work advocates a nuanced, task- and data-driven model selection strategy that leverages the strengths of LLMs alongside conventional approaches to improve predictive healthcare while preserving data privacy and clinical interpretability.

Abstract

Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR), while also assessing their reasoning, reliability, and fairness. Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-5, DeepSeek-V3.1-Think) show potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs can match or exceed proprietary counterparts. These results provide compelling evidence that modern LLMs are competitive tools for non-generative clinical prediction, particularly with unstructured text and offering data-efficient structured data options, thus necessitating a re-evaluation of model selection strategies. This research should serve as an important insight for medical informaticists, AI developers, and clinical researchers, potentially prompting a reassessment of current assumptions and inspiring new approaches to LLM application in predictive healthcare.

ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks

TL;DR

ClinicRealm provides a rigorous benchmark showing that modern large language models can rival or exceed specialized approaches for non-generative clinical prediction, especially with unstructured notes. Across mortality, readmission, and LOS tasks, zero-shot and data-efficient prompting enable strong performance, sometimes matching or surpassing proprietary models, with open-source LLMs contributing to accessible on-premise deployment. Multimodal integration yields nuanced outcomes, highlighting challenges in synthesizing structured EHR data with clinical narratives without task-specific finetuning. The study also emphasizes fairness and trustworthiness, demonstrating promising fairness signals in zero-shot prompting while identifying failure modes and deployment considerations. Overall, the work advocates a nuanced, task- and data-driven model selection strategy that leverages the strengths of LLMs alongside conventional approaches to improve predictive healthcare while preserving data privacy and clinical interpretability.

Abstract

Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR), while also assessing their reasoning, reliability, and fairness. Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-5, DeepSeek-V3.1-Think) show potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs can match or exceed proprietary counterparts. These results provide compelling evidence that modern LLMs are competitive tools for non-generative clinical prediction, particularly with unstructured text and offering data-efficient structured data options, thus necessitating a re-evaluation of model selection strategies. This research should serve as an important insight for medical informaticists, AI developers, and clinical researchers, potentially prompting a reassessment of current assumptions and inspiring new approaches to LLM application in predictive healthcare.
Paper Structure (39 sections, 4 figures, 27 tables)

This paper contains 39 sections, 4 figures, 27 tables.

Figures (4)

  • Figure 1: Comparative performance and recommendations for model selection in non-generative clinical tasks. This figure summarizes the benchmarking results, offering guidance on selecting optimal models for different clinical scenarios. It compares conventional DL/ML, BERT-style, and Large Language Models (LLMs) across two categories of tasks: (i) prediction tasks using unstructured clinical notes data (e.g., mortality, readmission prediction), and (ii) prediction tasks using structured Electronic Health Record (EHR) data (e.g., mortality, length-of-stay prediction).
  • Figure 2: Web-based interface for the human evaluation study. Expert evaluators were presented with the patient's clinical data (input), the LLM's generated reasoning and prediction, and the ground truth label. They used the interface to provide scores on a 1--5 Likert scale for three quality dimensions and to select predefined error types from a checklist.
  • Figure 3: Hierarchically structured error taxonomy for LLM-generated clinical reasoning. This taxonomy was developed by two expert clinicians using an open-coding thematic analysis of a pilot set of model outputs. It provided a standardized framework for the detailed error analysis.
  • Figure 4: AUROC performance of CatBoost and XGBoost on MIMIC-IV with varying training set sizes. The plots illustrate the AUROC for mortality (a, b) and readmission (c, d) prediction tasks. The shaded area represents the standard deviation over multiple runs. Performance for both models tends to saturate after the training set size reaches a few hundred samples, demonstrating that increasing data volume provides diminishing returns for these complex and heterogeneous prediction tasks.