CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
Paul Grundmann, Dennis Fast, Jan Frick, Thomas Steffek, Felix Gers, Wolfgang Nejdl, Alexander Löser
TL;DR
CliniBench provides a systematic benchmark to compare encoder-based classifiers and generative LLMs for predicting discharge diagnoses from admission notes in MIMIC-IV. The study shows encoder-based models consistently outperform generative LLMs in zero-shot settings, while retrieval augmentation and instruction-based prompts can partly elevate LLM performance. It also offers a comprehensive error analysis highlighting issues such as output redundancy, irrelevant content, and sensitivity to input length, along with methodological and ethical considerations for deploying LLMs in clinical settings. The benchmark thus establishes a framework to evaluate and close the gap between traditional encoders and generative models for clinical decision support, guiding future research on retrieval strategies, domain adaptation, and verification mechanisms.
Abstract
With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.
