Table of Contents
Fetching ...

CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

Paul Grundmann, Dennis Fast, Jan Frick, Thomas Steffek, Felix Gers, Wolfgang Nejdl, Alexander Löser

TL;DR

CliniBench provides a systematic benchmark to compare encoder-based classifiers and generative LLMs for predicting discharge diagnoses from admission notes in MIMIC-IV. The study shows encoder-based models consistently outperform generative LLMs in zero-shot settings, while retrieval augmentation and instruction-based prompts can partly elevate LLM performance. It also offers a comprehensive error analysis highlighting issues such as output redundancy, irrelevant content, and sensitivity to input length, along with methodological and ethical considerations for deploying LLMs in clinical settings. The benchmark thus establishes a framework to evaluate and close the gap between traditional encoders and generative models for clinical decision support, guiding future research on retrieval strategies, domain adaptation, and verification mechanisms.

Abstract

With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.

CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

TL;DR

CliniBench provides a systematic benchmark to compare encoder-based classifiers and generative LLMs for predicting discharge diagnoses from admission notes in MIMIC-IV. The study shows encoder-based models consistently outperform generative LLMs in zero-shot settings, while retrieval augmentation and instruction-based prompts can partly elevate LLM performance. It also offers a comprehensive error analysis highlighting issues such as output redundancy, irrelevant content, and sensitivity to input length, along with methodological and ethical considerations for deploying LLMs in clinical settings. The benchmark thus establishes a framework to evaluate and close the gap between traditional encoders and generative models for clinical decision support, guiding future research on retrieval strategies, domain adaptation, and verification mechanisms.

Abstract

With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.

Paper Structure

This paper contains 70 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Architecture diagram of the CliniBench benchmark. Depicted are both model classes: Generative LLM (top) and encoder-based classifier (bottom). In addition to zero-shot evaluation (1), the LLM can be augmented via retrieval augmentation (2) and chain-of-thought (CoT) prompting (3) and predicts text (valid JSON) that is mapped to distinct labels. The encoder (bottom) predicts a probability distribution over the possible diagnoses (4) which can be threshold-tuned to improve performance (5).
  • Figure 2: Macro $F_1$ (averaged over all datasets) of majority voting, best encoder, and best generative LLM with various amounts of demonstrations. Categories on the x-axis show different retrieving strategies for few-shot experiments. Horizontal lines depict the performance of the best zero-shot generative model and the best encoder models.
  • Figure 3: Macro recall and precision by model and classes grouped into tertiles by class frequency aggregated over ICD-9 and ICD-10 splits. Diagram shows the icu (top) and the hosp (bottom) splits.
  • Figure 4: Average number of predicted codes after deduplication for zero-shot and few-shot experiments. Small LLMs are represented in shades of red, large ones in shades of blue, with instruction-tuned models highlighted in a shaded pattern.
  • Figure 5: Micro $F_1$ over different sequence lengths averaged over all dataset splits. Encoder models have a drastically reduced performance with sequences >850 tokens while the generative Models tend have a more consistent performance on longer sequences. The token count varies across models due to differences in tokenizers.
  • ...and 6 more figures