Table of Contents
Fetching ...

A Retrieval-Based Approach to Medical Procedure Matching in Romanian

Andrei Niculae, Adrian Cosma, Emilian Radoi

TL;DR

This work tackles the challenge of aligning Romanian medical procedure names with insurance-standardized codes by reframing the task as retrieval rather than multiclass classification. It evaluates dense and sparse embeddings, including a fine-tuned mE5 model, within a Milvus vector store, and demonstrates that dense, metric-learning-tuned representations significantly outperform BM25 and hybrid approaches. The best system achieves Acc@1 of 85.2% in a setup combining masterlist entries with clinic mappings, and a doctor-validated Acc@1 of 94.7% with a 1200x speedup over manual mapping, underscoring practical impact for private healthcare reimbursement workflows in Romanian. The results advance medical NLP for low-resource languages and suggest that robust, scalable retrieval-based matching can be extended to similar settings with limited language-specific medical resources.

Abstract

Accurately mapping medical procedure names from healthcare providers to standardized terminology used by insurance companies is a crucial yet complex task. Inconsistencies in naming conventions lead to missclasified procedures, causing administrative inefficiencies and insurance claim problems in private healthcare settings. Many companies still use human resources for manual mapping, while there is a clear opportunity for automation. This paper proposes a retrieval-based architecture leveraging sentence embeddings for medical name matching in the Romanian healthcare system. This challenge is significantly more difficult in underrepresented languages such as Romanian, where existing pretrained language models lack domain-specific adaptation to medical text. We evaluate multiple embedding models, including Romanian, multilingual, and medical-domain-specific representations, to identify the most effective solution for this task. Our findings contribute to the broader field of medical NLP for low-resource languages such as Romanian.

A Retrieval-Based Approach to Medical Procedure Matching in Romanian

TL;DR

This work tackles the challenge of aligning Romanian medical procedure names with insurance-standardized codes by reframing the task as retrieval rather than multiclass classification. It evaluates dense and sparse embeddings, including a fine-tuned mE5 model, within a Milvus vector store, and demonstrates that dense, metric-learning-tuned representations significantly outperform BM25 and hybrid approaches. The best system achieves Acc@1 of 85.2% in a setup combining masterlist entries with clinic mappings, and a doctor-validated Acc@1 of 94.7% with a 1200x speedup over manual mapping, underscoring practical impact for private healthcare reimbursement workflows in Romanian. The results advance medical NLP for low-resource languages and suggest that robust, scalable retrieval-based matching can be extended to similar settings with limited language-specific medical resources.

Abstract

Accurately mapping medical procedure names from healthcare providers to standardized terminology used by insurance companies is a crucial yet complex task. Inconsistencies in naming conventions lead to missclasified procedures, causing administrative inefficiencies and insurance claim problems in private healthcare settings. Many companies still use human resources for manual mapping, while there is a clear opportunity for automation. This paper proposes a retrieval-based architecture leveraging sentence embeddings for medical name matching in the Romanian healthcare system. This challenge is significantly more difficult in underrepresented languages such as Romanian, where existing pretrained language models lack domain-specific adaptation to medical text. We evaluate multiple embedding models, including Romanian, multilingual, and medical-domain-specific representations, to identify the most effective solution for this task. Our findings contribute to the broader field of medical NLP for low-resource languages such as Romanian.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Diagram of the medical procedure matching problem. Clinics often have their own local names for medical procedures that are changed annually, for which a central insurance agency must match to a standardized list of procedures for reimbursement.
  • Figure 2: Overall diagram of our method. We formulate medical procedure matching as a retrieval problem: entries in the masterlist are embedded and stored in a vector store and the most similar entry is retrieved based on the similarity with a procedure name from a clinic.
  • Figure 3: Distribution of number of unique clinic descriptions per masterlist procedure. There is a severe data imbalance: 19,493 ( 50%) out of 39,097 entries contain only a single example.
  • Figure 4: Fine-tuning approach for dense sentence embeddings. A pretrained text embedding model is trained to minimize the distance between representations of masterlist entries and associated clinic procedure names while maximising the distance between every other entry.