Table of Contents
Fetching ...

Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models

Heejoon Koo

TL;DR

This work evaluates large language models for next-visit diagnosis prediction from noisy clinical notes, focusing on robustness and fairness across demographic subgroups. It introduces NECHO v3, combining a clinically grounded label-reduction mapping with hierarchical chain-of-thought prompting to manage a large diagnostic label space and mimic clinician reasoning. Through a systematic degradation pipeline applied to MIMIC-IV discharge notes, the study reveals stable overall performance but greater subgroup instability and fairness concerns for minority and younger groups under corruption. The findings advocate for structured prompt design and subgroup-aware evaluation to enable safer, more equitable deployment of LLM-based clinical decision support in noisy real-world settings.

Abstract

A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians' reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at https://github.com/heejkoo9/NECHOv3.

Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models

TL;DR

This work evaluates large language models for next-visit diagnosis prediction from noisy clinical notes, focusing on robustness and fairness across demographic subgroups. It introduces NECHO v3, combining a clinically grounded label-reduction mapping with hierarchical chain-of-thought prompting to manage a large diagnostic label space and mimic clinician reasoning. Through a systematic degradation pipeline applied to MIMIC-IV discharge notes, the study reveals stable overall performance but greater subgroup instability and fairness concerns for minority and younger groups under corruption. The findings advocate for structured prompt design and subgroup-aware evaluation to enable safer, more equitable deployment of LLM-based clinical decision support in noisy real-world settings.

Abstract

A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians' reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at https://github.com/heejkoo9/NECHOv3.

Paper Structure

This paper contains 41 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: An Overview of Our LLM Evaluation Pipeline under Various Clinical Note Degradation.
  • Figure 2: Performance across Racial Groups under Original and Corruption Settings.
  • Figure 3: Top-10 Child-level Clinical Sub-categories across Race Groups (for simplicity, we report results for three groups: White, Hispanic/Latino, and Unknown).
  • Figure 4: Fairness across Racial Groups under Original and Corruption Settings.
  • Figure 5: Top-10 Child-level Clinical Sub-categories across Age Groups (18–40, 41–60, 61+).
  • ...and 3 more figures