Table of Contents
Fetching ...

ACE-ICD: Acronym Expansion As Data Augmentation For Automated ICD Coding

Tuan-Dung Le, Shohreh Haddadan, Thanh Q. Thieu

TL;DR

ACE-ICD tackles the prevalent use of acronyms in clinical notes to improve ICD coding. It introduces acronym-expansion as data augmentation via open-source LLM prompting and a KL-divergence-based consistency training objective that aligns predictions between original and acronym-expanded notes, formalized as $\mathcal{L} = \frac{1}{2}(\mathcal{L}_{ce}(P, y) + \mathcal{L}_{ce}(P_{a}, y)) + \alpha \mathcal{L}_{cons}(P, P_{a}, y)$ with $\mathcal{L}_{cons}=\frac{1}{2}(KL[p(y|P)||p(y|P_a)] + KL[p(y|P_a)||p(y|P)])$. Evaluated on MIMIC-III across common, rare, and full-code settings, ACE-ICD achieves new state-of-the-art results, driven by robust acronym expansion with large open-source LLMs and effective consistency regularization, while preserving privacy through local processing. The work demonstrates notable improvements on low-frequency codes and provides a practical, scalable method that reduces reliance on external annotations or proprietary resources. By leveraging zero-shot acronym disambiguation and a lightweight preprocessing pipeline, ACE-ICD offers a significant, privacy-conscious advancement for automated clinical coding.

Abstract

Automatic ICD coding, the task of assigning disease and procedure codes to electronic medical records, is crucial for clinical documentation and billing. While existing methods primarily enhance model understanding of code hierarchies and synonyms, they often overlook the pervasive use of medical acronyms in clinical notes, a key factor in ICD code inference. To address this gap, we propose a novel effective data augmentation technique that leverages large language models to expand medical acronyms, allowing models to be trained on their full form representations. Moreover, we incorporate consistency training to regularize predictions by enforcing agreement between the original and augmented documents. Extensive experiments on the MIMIC-III dataset demonstrate that our approach, ACE-ICD establishes new state-of-the-art performance across multiple settings, including common codes, rare codes, and full-code assignments. Our code is publicly available.

ACE-ICD: Acronym Expansion As Data Augmentation For Automated ICD Coding

TL;DR

ACE-ICD tackles the prevalent use of acronyms in clinical notes to improve ICD coding. It introduces acronym-expansion as data augmentation via open-source LLM prompting and a KL-divergence-based consistency training objective that aligns predictions between original and acronym-expanded notes, formalized as with . Evaluated on MIMIC-III across common, rare, and full-code settings, ACE-ICD achieves new state-of-the-art results, driven by robust acronym expansion with large open-source LLMs and effective consistency regularization, while preserving privacy through local processing. The work demonstrates notable improvements on low-frequency codes and provides a practical, scalable method that reduces reliance on external annotations or proprietary resources. By leveraging zero-shot acronym disambiguation and a lightweight preprocessing pipeline, ACE-ICD offers a significant, privacy-conscious advancement for automated clinical coding.

Abstract

Automatic ICD coding, the task of assigning disease and procedure codes to electronic medical records, is crucial for clinical documentation and billing. While existing methods primarily enhance model understanding of code hierarchies and synonyms, they often overlook the pervasive use of medical acronyms in clinical notes, a key factor in ICD code inference. To address this gap, we propose a novel effective data augmentation technique that leverages large language models to expand medical acronyms, allowing models to be trained on their full form representations. Moreover, we incorporate consistency training to regularize predictions by enforcing agreement between the original and augmented documents. Extensive experiments on the MIMIC-III dataset demonstrate that our approach, ACE-ICD establishes new state-of-the-art performance across multiple settings, including common codes, rare codes, and full-code assignments. Our code is publicly available.

Paper Structure

This paper contains 19 sections, 2 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Our training pipeline incorporating acronym-expanded data augmentation and consistency training.
  • Figure 2: F1 improvement per code in MIMIC-III-50 dataset (sorted by number of training examples in descending order).