Table of Contents
Fetching ...

Evaluation and LLM-Guided Learning of ICD Coding Rationales

Mingyang Li, Viktor Schlegel, Tingting Mu, Wuraola Oyewusi, Kai Kang, Goran Nenadic

Abstract

ICD coding is the process of mapping unstructured text from Electronic Health Records (EHRs) to standardised codes defined by the International Classification of Diseases (ICD) system. In order to promote trust and transparency, existing explorations on the explainability of ICD coding models primarily rely on attention-based rationales and qualitative assessments conducted by physicians, yet lack a systematic evaluation across diverse types of rationales using consistent criteria and high-quality rationale-annotated datasets specifically designed for the ICD coding task. Moreover, dedicated methods explicitly trained to generate plausible rationales remain scarce. In this work, we present evaluations of the explainability of rationales in ICD coding, focusing on two fundamental dimensions: faithfulness and plausibility -- in short how rationales influence model decisions and how convincing humans find them. For plausibility, we construct a novel, multi-granular rationale-annotated ICD coding dataset, based on the MIMIC-IV database and the updated ICD-10 coding system. We conduct a comprehensive evaluation across three types of ICD coding rationales: entity-level mentions automatically constructed via entity linking, LLM-generated rationales, and rationales based on attention scores of ICD coding models. Building upon the strong plausibility exhibited by LLM-generated rationales, we further leverage them as distant supervision signals to develop rationale learning methods. Additionally, by prompting the LLM with few-shot human-annotated examples from our dataset, we achieve notable improvements in the plausibility of rationale generation in both the teacher LLM and the student rationale learning models.

Evaluation and LLM-Guided Learning of ICD Coding Rationales

Abstract

ICD coding is the process of mapping unstructured text from Electronic Health Records (EHRs) to standardised codes defined by the International Classification of Diseases (ICD) system. In order to promote trust and transparency, existing explorations on the explainability of ICD coding models primarily rely on attention-based rationales and qualitative assessments conducted by physicians, yet lack a systematic evaluation across diverse types of rationales using consistent criteria and high-quality rationale-annotated datasets specifically designed for the ICD coding task. Moreover, dedicated methods explicitly trained to generate plausible rationales remain scarce. In this work, we present evaluations of the explainability of rationales in ICD coding, focusing on two fundamental dimensions: faithfulness and plausibility -- in short how rationales influence model decisions and how convincing humans find them. For plausibility, we construct a novel, multi-granular rationale-annotated ICD coding dataset, based on the MIMIC-IV database and the updated ICD-10 coding system. We conduct a comprehensive evaluation across three types of ICD coding rationales: entity-level mentions automatically constructed via entity linking, LLM-generated rationales, and rationales based on attention scores of ICD coding models. Building upon the strong plausibility exhibited by LLM-generated rationales, we further leverage them as distant supervision signals to develop rationale learning methods. Additionally, by prompting the LLM with few-shot human-annotated examples from our dataset, we achieve notable improvements in the plausibility of rationale generation in both the teacher LLM and the student rationale learning models.

Paper Structure

This paper contains 68 sections, 7 equations, 11 figures, 34 tables.

Figures (11)

  • Figure 1: Faithfulness testing workflow. Sufficiency and comprehensiveness are evaluated by retaining or removing rationales from the original documents and using the modified texts as inputs to the trained ICD coding models.
  • Figure 2: Statistics of code and annotation frequencies for the top 10 codes in RD-IV-10 and MDACE. RD-IV-10 provides richer annotations for each label than MDACE.
  • Figure 3: Examples of three types of rationales evaluated for plausibility. Unsup. / Sup. denote Unsupervised and Supervised, separately. Appr. indicates Approach.
  • Figure 4: Faithfulness results of ICD coding models on four MIMIC datasets. The y-axis represents the decrease ratio, computed based on Precision@N scores (N = 8 for the Full set and N = 5 for the Top-50 set) as $(\mathrm{S}_{\mathrm{orig}} - \mathrm{S}_{\mathrm{com/suff}})/\mathrm{S}_{\mathrm{orig}} \times 100\%$ , where $\mathrm{S}_{\mathrm{orig}}$ denotes the performance obtained with the original input, and, $\mathrm{S}_{\mathrm{com/suff}}$ denotes the performance obtained when the input is modified by removing or retaining the rationales. The x-axis denotes the number of most-attended tokens selected. $\uparrow$ denotes higher is better; $\downarrow$ denotes lower is better.
  • Figure 6: The statistics of overlap between MIMIC-III ICD-9 code set and mapped ICD-9 code set in MDACE.
  • ...and 6 more figures