Table of Contents
Fetching ...

Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case

Vignesh Kumar Kembu, Pierandrea Morandini, Marta Bianca Maria Ranzini, Antonino Nocera

TL;DR

The paper assesses whether zero-shot, open-source multilingual LLMs can extract comorbidities from Italian EHR text in an on-premises healthcare setting. It builds a rigorous comparison against regex baselines and clinician-annotated ground truth, using 8,223 Italian Anamnesis records and five target comorbidities. Across six LLM families, results show limited generalization and insufficient accuracy relative to regex or human annotation, highlighting trust and safety concerns for deploying LLMs in healthcare. The study underscores the continued value of pattern-matching for information retrieval in clinical texts and points to future directions in in-context learning and model fine-tuning to improve performance.

Abstract

Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.

Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case

TL;DR

The paper assesses whether zero-shot, open-source multilingual LLMs can extract comorbidities from Italian EHR text in an on-premises healthcare setting. It builds a rigorous comparison against regex baselines and clinician-annotated ground truth, using 8,223 Italian Anamnesis records and five target comorbidities. Across six LLM families, results show limited generalization and insufficient accuracy relative to regex or human annotation, highlighting trust and safety concerns for deploying LLMs in healthcare. The study underscores the continued value of pattern-matching for information retrieval in clinical texts and points to future directions in in-context learning and model fine-tuning to improve performance.

Abstract

Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.

Paper Structure

This paper contains 12 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Methodology - Data gathering pipeline, data annotation and comparison: from regexp based automatic classification and clinicians validated ground truth to LLM extraction.
  • Figure 2: Classification using regular expressions for the chosen comorbidities - a)Fibrillazione atriale, b)Insufficienza Renale, c)BPCO-Broncopneumopatia cronica ostruttiva, d)Diabete mellito and e)Ipertensione arteriosa.
  • Figure 3: LLMs accuracy compared to regular expression -a)OpenLLaMA 3B, b)OpenLLaMA 7B, c)Mistral 7B, d)Mixtral 8x7B, e)Qwen2.5 3B and e)Qwen2.5 7B
  • Figure 4: Overall accuracy across different models compared to a)regular expression annotation and b) manual annotation.
  • Figure 5: Manual annotation classification of the chosen comorbidities - a)Fibrillazione atriale, b)Insufficienza Renale, c)BPCO-Broncopneumopatia cronica ostruttiva, d)Diabete mellito and e)Ipertensione arteriosa.
  • ...and 2 more figures