Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation
Xindi Wang, Robert E. Mercer, Frank Rudzicz
TL;DR
This work addresses automated ICD coding under a very large and long-tailed label space with lengthy clinical notes. It introduces a multi-stage retrieve-and-rerank framework: a retrieval stage using auxiliary EHR knowledge (DRG, CPT, medications) and BM25 to produce a compact candidate set, followed by a re-ranking stage that employs a Graphormer-based label encoder on a code co-occurrence graph and a contrastive learning objective to align notes with their true codes. The approach achieves state-of-the-art performance on MIMIC-III full and top-50 settings, particularly in precision-at-k metrics, and demonstrates the importance of leveraging long context and label interdependencies. By effectively narrowing the label space and exploiting external knowledge and code co-occurrence, the method offers a practical ICU-ready solution for automated medical coding with potential extensions to UMLS and synonymy in future work.
Abstract
The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage ``retrieve and re-rank'' framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.
