Table of Contents
Fetching ...

Code Like Humans: A Multi-Agent Solution for Medical Coding

Andreas Motzfeldt, Joakim Edin, Casper L. Christensen, Christian Hardmeier, Lars Maaløe, Anna Rogers

TL;DR

Medical coding maps unstructured clinical notes to ICD-10-CM codes, a labor-intensive task with significant implications for patient care and revenue. The authors introduce Code Like Humans (CLH), a multi-agent LLM-based framework that leverages external ICD resources—alphabetical index, hierarchy, and guidelines—to emulate human coders and support open-set coding across the full 70K-code ICD-10-CM space. CLH achieves competitive macro-F1 against state-of-the-art discriminative models on rare codes and provides a detailed analysis of its strengths and blind spots, highlighting the practicality of human-in-the-loop deployment. The work argues for assistive tooling rather than full automation in clinical coding and outlines future directions in data resources, retrieval strategies, and component-level fine-tuning to enhance real-world applicability.

Abstract

In medical coding, experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce Code Like Humans: a new agentic framework for medical coding with large language models. It implements official coding guidelines for human experts, and it is the first solution that can support the full ICD-10 coding system (+70K labels). It achieves the best performance to date on rare diagnosis codes (fine-tuned discriminative classifiers retain an advantage for high-frequency codes, to which they are limited). Towards future work, we also contribute an analysis of system performance and identify its `blind spots' (codes that are systematically undercoded).

Code Like Humans: A Multi-Agent Solution for Medical Coding

TL;DR

Medical coding maps unstructured clinical notes to ICD-10-CM codes, a labor-intensive task with significant implications for patient care and revenue. The authors introduce Code Like Humans (CLH), a multi-agent LLM-based framework that leverages external ICD resources—alphabetical index, hierarchy, and guidelines—to emulate human coders and support open-set coding across the full 70K-code ICD-10-CM space. CLH achieves competitive macro-F1 against state-of-the-art discriminative models on rare codes and provides a detailed analysis of its strengths and blind spots, highlighting the practicality of human-in-the-loop deployment. The work argues for assistive tooling rather than full automation in clinical coding and outlines future directions in data resources, retrieval strategies, and component-level fine-tuning to enhance real-world applicability.

Abstract

In medical coding, experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce Code Like Humans: a new agentic framework for medical coding with large language models. It implements official coding guidelines for human experts, and it is the first solution that can support the full ICD-10 coding system (+70K labels). It achieves the best performance to date on rare diagnosis codes (fine-tuned discriminative classifiers retain an advantage for high-frequency codes, to which they are limited). Towards future work, we also contribute an analysis of system performance and identify its `blind spots' (codes that are systematically undercoded).

Paper Structure

This paper contains 49 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of Code-Like-Humans, our agentic framework whose structure mirrors the Analyze-Locate-Assign-Verify approach of the UK National Health Service. The four agents sequentially emulate how medical coders extract evidence, navigate the alphabetical index, validate the ICD hierarchy, and reconcile coding conventions to 'translate' clinical notes into ICD codes.
  • Figure 2: Chapter-level comparison of retrieval recall@$25$. The x-axis reports retrieving alphabetical index terms with the 1. evidence extractor, while the y-axis reports retrieval with expert-annotated evidence.
  • Figure 3: F1 micro scores for the tabular validator (step 3) and code reconciler (step 4) as the number of negative codes increases. The code reconciler (step 4) remains more robust in larger candidate sets due to its simpler prediction task.
  • Figure 4: F1 micro scores for 3. tabular validator with added context. Guideline information yields the strongest gains as candidate codes increase.
  • Figure 5: F1 micro scores for the 3. tabular validator with and without reasoning. Reasoning is consistently superior, especially with more negative codes and longer contexts.
  • ...and 3 more figures