Table of Contents
Fetching ...

MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis

Yuexin Wu, Shiqi Wang, Vasile Rus

TL;DR

The paper introduces MIMIC-SR-ICD11, a large English diagnostic dataset that converts de-identified EHR discharge notes into first-person patient self-reports aligned to ICD-11. It also presents LL-Rank, a PMI-style re-ranking method that combines conditional likelihood with a corpus-derived prior to rank diagnoses, improving over the GenMap baseline. Across seven model backbones, LL-Rank achieves substantial gains, particularly for long-tail diagnoses, with alpha near 1 yielding best performance. The work enables scalable pretraining and domain adaptation in clinical NLP by providing a realistic narrative-driven dataset and a principled scoring approach for diagnostic ranking, under clear licensing and ethical constraints.

Abstract

Disease diagnosis is a central pillar of modern healthcare, enabling early detection and timely intervention for acute conditions while guiding lifestyle adjustments and medication regimens to prevent or slow chronic disease. Self-reports preserve clinically salient signals that templated electronic health record (EHR) documentation often attenuates or omits, especially subtle but consequential details. To operationalize this shift, we introduce MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge notes and natively aligned to WHO ICD-11 terminology. We further present LL-Rank, a likelihood-based re-ranking framework that computes a length-normalized joint likelihood of each label given the clinical report context and subtracts the corresponding report-free prior likelihood for that label. Across seven model backbones, LL-Rank consistently outperforms a strong generation-plus-mapping baseline (GenMap). Ablation experiments show that LL-Rank's gains primarily stem from its PMI-based scoring, which isolates semantic compatibility from label frequency bias.

MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis

TL;DR

The paper introduces MIMIC-SR-ICD11, a large English diagnostic dataset that converts de-identified EHR discharge notes into first-person patient self-reports aligned to ICD-11. It also presents LL-Rank, a PMI-style re-ranking method that combines conditional likelihood with a corpus-derived prior to rank diagnoses, improving over the GenMap baseline. Across seven model backbones, LL-Rank achieves substantial gains, particularly for long-tail diagnoses, with alpha near 1 yielding best performance. The work enables scalable pretraining and domain adaptation in clinical NLP by providing a realistic narrative-driven dataset and a principled scoring approach for diagnostic ranking, under clear licensing and ethical constraints.

Abstract

Disease diagnosis is a central pillar of modern healthcare, enabling early detection and timely intervention for acute conditions while guiding lifestyle adjustments and medication regimens to prevent or slow chronic disease. Self-reports preserve clinically salient signals that templated electronic health record (EHR) documentation often attenuates or omits, especially subtle but consequential details. To operationalize this shift, we introduce MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge notes and natively aligned to WHO ICD-11 terminology. We further present LL-Rank, a likelihood-based re-ranking framework that computes a length-normalized joint likelihood of each label given the clinical report context and subtracts the corresponding report-free prior likelihood for that label. Across seven model backbones, LL-Rank consistently outperforms a strong generation-plus-mapping baseline (GenMap). Ablation experiments show that LL-Rank's gains primarily stem from its PMI-based scoring, which isolates semantic compatibility from label frequency bias.

Paper Structure

This paper contains 42 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Effect of the PMI coefficient ($\alpha$) on LL-Rank scoring.
  • Figure 2: Data construction pipeline. Left branch: extract primary diagnoses from MIMIC-IV and map ICD-9 to ICD-10 then ICD-10 to ICD-11, with one-to-one filtering and manual quality control. Right branch: rewrite MIMIC-IV-Note into first-person self-reports using ChatGPT. The dashed box shows the exact prompt we used in practice. Outputs: final ICD-11 labels and paired records.
  • Figure 3: Illustration of the standardized prompt format (instruction, candidate list, and patient self-report) provided to the general LLMs.
  • Figure 4: Prompt used for evaluation.
  • Figure 5: Distribution of clinical specialties in the primary-diagnosis subset.
  • ...and 1 more figures