Table of Contents
Fetching ...

Can Large Language Models abstract Medical Coded Language?

Simon A. Lee, Timothy Lindsey

TL;DR

This paper investigates whether large language models can learn and generate names from structured medical codes (ICD-9/10, CPT, LOINC, NDC), a non-natural language domain with frequent updates. It evaluates a broad set of general and biomedical LLMs under two tasks—identifying code chapters and predicting specific codes—plus an adversarial prompt to probe hallucinations, using both open-source and API-based models restricted from internet access. The results show that LLMs generally struggle to generate correct code names, with performance improving only for the largest models (GPT-4) and predominantly for common codes; rare codes remain challenging and adversarial prompts reveal persistent hallucinations. To address these gaps, the authors discuss potential approaches such as knowledge graphs with retrieval augmentation, chain-of-thought reasoning, and synthetic/text-engineered data to improve the learning of coded representations. The findings underscore important limitations for deploying LLMs in automated coding and billing tasks in healthcare and provide a concrete roadmap for future work to enhance ontologies and representations in medical coding systems.

Abstract

Large Language Models (LLMs) have become a pivotal research area, potentially making beneficial contributions in fields like healthcare where they can streamline automated billing and decision support. However, the frequent use of specialized coded languages like ICD-10, which are regularly updated and deviate from natural language formats, presents potential challenges for LLMs in creating accurate and meaningful latent representations. This raises concerns among healthcare professionals about potential inaccuracies or ``hallucinations" that could result in the direct impact of a patient. Therefore, this study evaluates whether large language models (LLMs) are aware of medical code ontologies and can accurately generate names from these codes. We assess the capabilities and limitations of both general and biomedical-specific generative models, such as GPT, LLaMA-2, and Meditron, focusing on their proficiency with domain-specific terminologies. While the results indicate that LLMs struggle with coded language, we offer insights on how to adapt these models to reason more effectively.

Can Large Language Models abstract Medical Coded Language?

TL;DR

This paper investigates whether large language models can learn and generate names from structured medical codes (ICD-9/10, CPT, LOINC, NDC), a non-natural language domain with frequent updates. It evaluates a broad set of general and biomedical LLMs under two tasks—identifying code chapters and predicting specific codes—plus an adversarial prompt to probe hallucinations, using both open-source and API-based models restricted from internet access. The results show that LLMs generally struggle to generate correct code names, with performance improving only for the largest models (GPT-4) and predominantly for common codes; rare codes remain challenging and adversarial prompts reveal persistent hallucinations. To address these gaps, the authors discuss potential approaches such as knowledge graphs with retrieval augmentation, chain-of-thought reasoning, and synthetic/text-engineered data to improve the learning of coded representations. The findings underscore important limitations for deploying LLMs in automated coding and billing tasks in healthcare and provide a concrete roadmap for future work to enhance ontologies and representations in medical coding systems.

Abstract

Large Language Models (LLMs) have become a pivotal research area, potentially making beneficial contributions in fields like healthcare where they can streamline automated billing and decision support. However, the frequent use of specialized coded languages like ICD-10, which are regularly updated and deviate from natural language formats, presents potential challenges for LLMs in creating accurate and meaningful latent representations. This raises concerns among healthcare professionals about potential inaccuracies or ``hallucinations" that could result in the direct impact of a patient. Therefore, this study evaluates whether large language models (LLMs) are aware of medical code ontologies and can accurately generate names from these codes. We assess the capabilities and limitations of both general and biomedical-specific generative models, such as GPT, LLaMA-2, and Meditron, focusing on their proficiency with domain-specific terminologies. While the results indicate that LLMs struggle with coded language, we offer insights on how to adapt these models to reason more effectively.
Paper Structure (30 sections, 5 figures, 4 tables)

This paper contains 30 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Each Digit in a Medical Code corresponds to an encoded structural meaning
  • Figure 2: An Overview of the Proposed Experiments to Evaluate the LLMs on Whether or Not They Can Properly Predict (Understand) the Corresponding Medical Codes.
  • Figure 3: The frequencies of ICD codes found in the publicly available MIMIC-IV Database and plotting their occurences to differentiate common from uncommon codes.
  • Figure 4: Warning Messages shown from the Desktop version of Gemini
  • Figure 5: The frequencies of ICD codes found in the MIMIC-IV Database and plotting their counts to differentiate abundant codes from not abundant.