Table of Contents
Fetching ...

Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei, Atena Farangi, AmirBahador Boroumand

TL;DR

This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models to establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources.

Abstract

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B--8B parameters) consistently outperformed smaller counterparts in sensitivity and MCC. A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance, though at the cost of slightly lower specificity and precision. Feature-level results showed reliable extraction of physiological symptoms across most models, whereas psychological complaints, administrative requests, and complex somatic features remained challenging. These findings establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources, and highlight the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.

Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

TL;DR

This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models to establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources.

Abstract

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B--8B parameters) consistently outperformed smaller counterparts in sensitivity and MCC. A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance, though at the cost of slightly lower specificity and precision. Feature-level results showed reliable extraction of physiological symptoms across most models, whereas psychological complaints, administrative requests, and complex somatic features remained challenging. These findings establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources, and highlight the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.
Paper Structure (18 sections, 1 equation, 3 figures, 1 table)

This paper contains 18 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Schematic overview of the study. The upper panel shows the dataset preprocessing, inference generation, and postprocessing, starting from 1,221 Persian palliative care phone-call transcripts, followed by translation into English, prompt construction with input--output examples, and inference using multiple small language models (SLMs). The models' structured outputs are then post-processed to extract tabular data. The lower panel illustrates the multi-facet analysis framework, comparing manual extraction of 13 reference features with model-derived features through performance metrics (accuracy, sensitivity, specificity, precision, F1-score), assessment of translation effects, crobustness analysis (Matthews correlation coefficient (MCC), missing values), and sensitivity--specificity trade-offs.
  • Figure 2: Comparative performances of different models on validation metrics.(A) The median value for 5 metrics of accuracy, sensitivity, specificity, macro-averaged F1 score, and precision among 13 extracted features compared to the manually-extracted ground truth. (B) Matthews Correlation Coefficient (MCC) values for each evaluated model across the 13 extracted clinical features, comparing model-generated outputs with the manually extracted ground-truth annotations. (C) Total number of missing counts for each model among different extracted features.
  • Figure 3: Sensitivity–specificity trade-offs across evaluated small language models. Each panel depicts sensitivity (y-axis) versus specificity (x-axis) for 13 binary clinical features (points labeled 1–13; see legend) for (A) Llama-3.1-8B-Instruct, (B) Qwen2.5-7B-Instruct, (C) Llama-3.2-3B-Instruct, (D) Qwen2.5-1.5B-Instruct, (E) Gemma-3-1B-it, (F) Aya-expanse-8B (English), and (G) Aya-expanse-8B (Persian). The dashed diagonal line denotes the locus where sensitivity equals specificity.