Table of Contents
Fetching ...

Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Egbert van der Haring, Kees van Boven, Marcelo Finger, Luis Fernandez Lopez

TL;DR

This paper benchmarks 33 large language models on automating ICPC-2 code selection by pairing clinical Brazilian Portuguese queries with a domain-specific semantic search engine. Treating the task as extract-retrieve-select, the study shows many models achieving F1-scores above 0.8, with top performers including gpt-4.5-preview, o3, and gemini-2.5-pro, and demonstrates that optimizing the retriever can yield up to ~4 percentage points of improvement. It also highlights practical challenges, such as formatting compliance, hallucination risk, and cost, and finds that larger models generally scale better up to ~30B parameters before the returns diminish. The work provides a baseline benchmark and emphasizes the need for broader multilingual, end-to-end clinical validation, improved prompting strategies, and potential data-annotation refinement to make LLM-assisted coding clinically viable.

Abstract

Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.

Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

TL;DR

This paper benchmarks 33 large language models on automating ICPC-2 code selection by pairing clinical Brazilian Portuguese queries with a domain-specific semantic search engine. Treating the task as extract-retrieve-select, the study shows many models achieving F1-scores above 0.8, with top performers including gpt-4.5-preview, o3, and gemini-2.5-pro, and demonstrates that optimizing the retriever can yield up to ~4 percentage points of improvement. It also highlights practical challenges, such as formatting compliance, hallucination risk, and cost, and finds that larger models generally scale better up to ~30B parameters before the returns diminish. The work provides a baseline benchmark and emphasizes the need for broader multilingual, end-to-end clinical validation, improved prompting strategies, and potential data-annotation refinement to make LLM-assisted coding clinically viable.

Abstract

Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.

Paper Structure

This paper contains 27 sections, 1 equation, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Relationship between mean price in USD per 1,000 responses and F1-score. For each model, the max F1-score was considered. Locally tested models were not included. Note the x-axis in log scale.
  • Figure 2: Relationship between mean token usage per response and F1-score. For each model, the max F1-score was considered.
  • Figure A.1: Frequency map of ICPC-2 codes in the evaluation dataset (except process codes). A blank square means that the corresponding code is not represented in the evaluation dataset. A gray square means that the corresponding code does not exist in ICPC-2.
  • Figure A.2: Frequency map of process ICPC-2 codes in the evaluation dataset. A blank square means that the corresponding code is not represented in the evaluation dataset. A gray square means that the corresponding code does not exist in ICPC-2.
  • Figure A.3: Relative improvement in F1-score only considering cases in which there was a relevant code among the search results. For each model, the max F1-score was considered.
  • ...and 5 more figures