Table of Contents
Fetching ...

Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation

Zhangdie Yuan, Han-Chin Shing, Mitch Strong, Chaitanya Shivade

TL;DR

This work tackles the challenge of accurate clinical coding with ICD-10-CM by exposing how exact-match metrics miss hierarchically near-miss errors and proposing lightweight, practical remedies. It introduces a generate-expand-verify pipeline that uses ICD-10-CM structure for candidate expansion and contextual revision, complemented by prompt engineering and small-scale fine-tuning to improve generation. To address data limitations, the authors release a double expert-annotated outpatient notes benchmark with ICD-10-CM codes and show that verification substantially boosts end-to-end accuracy, with notable gains for stronger models. The findings demonstrate that verification, along with structured prompts and modest fine-tuning, can make LLM-based clinical coding more reliable and hospital-friendly, while underscoring the need for careful clinical validation, privacy safeguards, and human oversight before deployment.

Abstract

Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.

Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation

TL;DR

This work tackles the challenge of accurate clinical coding with ICD-10-CM by exposing how exact-match metrics miss hierarchically near-miss errors and proposing lightweight, practical remedies. It introduces a generate-expand-verify pipeline that uses ICD-10-CM structure for candidate expansion and contextual revision, complemented by prompt engineering and small-scale fine-tuning to improve generation. To address data limitations, the authors release a double expert-annotated outpatient notes benchmark with ICD-10-CM codes and show that verification substantially boosts end-to-end accuracy, with notable gains for stronger models. The findings demonstrate that verification, along with structured prompts and modest fine-tuning, can make LLM-based clinical coding more reliable and hospital-friendly, while underscoring the need for careful clinical validation, privacy safeguards, and human oversight before deployment.

Abstract

Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.

Paper Structure

This paper contains 39 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: An illustration of our generate-expand-verify pipeline. In this obfuscated example, the model-predicted code has the correct description with the wrong ICD-10-CM code. The expansion step uses ICD-10-CM tabular table to lookup its siblings. The verification step then selects the correct code and description based on the clinical notes and the expansion candidates.
  • Figure 2: Example fragment of the ICD-10-CM hierarchy, adapted from yih2023broad. Only leaf nodes are billable. Nodes with the same parent are considered siblings; those with the same grandparent are cousins.