Large language models are good medical coders, if provided with tools
Keith Kwan
TL;DR
The paper tackles automated ICD-10-CM coding and demonstrates that direct LLM-based approaches underperform compared with a retrieval-augmented paradigm. It introduces a two-stage Retrieve-Rank system that uses ColBERT-V2 to retrieve candidate codes and GPT-3.5-turbo to rerank them, achieving perfect accuracy on a 100-sample dataset versus a 6% baseline. This substantial gain showcases the promise of retrieval-augmented methods for medical coding and suggests practical benefits in accuracy and efficiency, while acknowledging the need for validation on larger and more realistic datasets. The work contributes to the shift toward retrieval-based medical NLP systems and highlights avenues for interpretability and broader deployment in healthcare administration.
Abstract
This study presents a novel two-stage Retrieve-Rank system for automated ICD-10-CM medical coding, comparing its performance against a Vanilla Large Language Model (LLM) approach. Evaluating both systems on a dataset of 100 single-term medical conditions, the Retrieve-Rank system achieved 100% accuracy in predicting correct ICD-10-CM codes, significantly outperforming the Vanilla LLM (GPT-3.5-turbo), which achieved only 6% accuracy. Our analysis demonstrates the Retrieve-Rank system's superior precision in handling various medical terms across different specialties. While these results are promising, we acknowledge the limitations of using simplified inputs and the need for further testing on more complex, realistic medical cases. This research contributes to the ongoing effort to improve the efficiency and accuracy of medical coding, highlighting the importance of retrieval-based approaches.
