Table of Contents
Fetching ...

Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation

Xindi Wang, Robert E. Mercer, Frank Rudzicz

TL;DR

This work addresses automated ICD coding under a very large and long-tailed label space with lengthy clinical notes. It introduces a multi-stage retrieve-and-rerank framework: a retrieval stage using auxiliary EHR knowledge (DRG, CPT, medications) and BM25 to produce a compact candidate set, followed by a re-ranking stage that employs a Graphormer-based label encoder on a code co-occurrence graph and a contrastive learning objective to align notes with their true codes. The approach achieves state-of-the-art performance on MIMIC-III full and top-50 settings, particularly in precision-at-k metrics, and demonstrates the importance of leveraging long context and label interdependencies. By effectively narrowing the label space and exploiting external knowledge and code co-occurrence, the method offers a practical ICU-ready solution for automated medical coding with potential extensions to UMLS and synonymy in future work.

Abstract

The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage ``retrieve and re-rank'' framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.

Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation

TL;DR

This work addresses automated ICD coding under a very large and long-tailed label space with lengthy clinical notes. It introduces a multi-stage retrieve-and-rerank framework: a retrieval stage using auxiliary EHR knowledge (DRG, CPT, medications) and BM25 to produce a compact candidate set, followed by a re-ranking stage that employs a Graphormer-based label encoder on a code co-occurrence graph and a contrastive learning objective to align notes with their true codes. The approach achieves state-of-the-art performance on MIMIC-III full and top-50 settings, particularly in precision-at-k metrics, and demonstrates the importance of leveraging long context and label interdependencies. By effectively narrowing the label space and exploiting external knowledge and code co-occurrence, the method offers a practical ICU-ready solution for automated medical coding with potential extensions to UMLS and synonymy in future work.

Abstract

The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage ``retrieve and re-rank'' framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.
Paper Structure (20 sections, 14 equations, 5 figures, 2 tables)

This paper contains 20 sections, 14 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An example of a medical record from the MIMIC-III dataset which includes the discharge summary, assigned ICD codes and auxiliary knowledge. We colour each code and its corresponding mentions in the discharge summary and auxiliary knowledge. We use the auxiliary knowledge of the notes to retrieve the candidate subset of the label space.
  • Figure 2: Overview of the proposed multi-stage retrieve and re-rank framework. The model first leverages auxiliary knowledge and BM25 to retrieve a candidate list from the full label space, then uses a re-rank model that leverages the code co-occurrence guided contrastive learning to generate the final relevant labels.
  • Figure 3: (a) ICD code distribution. (b) Macro-AUC performance comparison of our model and CAML on ICD codes at different frequency. (c) Micro-F1 performance comparison of our model and CAML on ICD codes at different frequency.
  • Figure 4: Case study on the effectiveness of incorporating label co-occurrence. Correctly predicted labels are marked in green and the incorrect ones are marked in red.
  • Figure 5: Case study on the effectiveness of incorporating auxiliary knowledge. Correctly predicted labels are marked in green and the incorrect ones are marked in red.