Table of Contents
Fetching ...

Low-resource Information Extraction with the European Clinical Case Corpus

Soumitra Ghosh, Begona Altuna, Saeed Farzi, Pietro Ferrazzi, Alberto Lavelli, Giulia Mezzanotte, Manuela Speranza, Bernardo Magnini

TL;DR

This work tackles data scarcity in clinical information extraction by introducing E3C-3.0, a multilingual corpus of clinical cases with annotated diseases and test-result relations, including both native texts and English-projected translations. It presents a semi-automatic extension pipeline that uses LLMs (e.g., GPT-4) for translation and annotation projection, followed by targeted human revision, enabling high-quality annotations in additional languages. Through extensive experiments with MedMT5 and Llama3-8B, the study shows that both fine-tuning on E3C-3.0 and cross-language transfer significantly boost IE performance in low-resource languages, while augmented multilingual training helps close gaps between native and projected data. The dataset and findings offer practical avenues for building robust multilingual clinical IE systems and highlight the value of combining domain-specific pretraining with multilingual transfer for healthcare NLP.

Abstract

We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .

Low-resource Information Extraction with the European Clinical Case Corpus

TL;DR

This work tackles data scarcity in clinical information extraction by introducing E3C-3.0, a multilingual corpus of clinical cases with annotated diseases and test-result relations, including both native texts and English-projected translations. It presents a semi-automatic extension pipeline that uses LLMs (e.g., GPT-4) for translation and annotation projection, followed by targeted human revision, enabling high-quality annotations in additional languages. Through extensive experiments with MedMT5 and Llama3-8B, the study shows that both fine-tuning on E3C-3.0 and cross-language transfer significantly boost IE performance in low-resource languages, while augmented multilingual training helps close gaps between native and projected data. The dataset and findings offer practical avenues for building robust multilingual clinical IE systems and highlight the value of combining domain-specific pretraining with multilingual transfer for healthcare NLP.

Abstract

We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .

Paper Structure

This paper contains 25 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: A clinical case example from the English E3C dataset (document EN103007).
  • Figure 2: A sample E3C annotated text (excerpt from EN103007.xml) is displayed using the WebAnno annotation tool. Annotations are highlighted: clinical entities (red), results/measurements (grey), events (blue), temporal expressions (purple), actors (turquoise), and body parts (lilac).
  • Figure 3: Procedure for the extension of an E3C annotated dataset to a target language.
  • Figure 4: Example of the stand-off XMI for clinical case EN103007.xml in Figure \ref{['fig:il_en']}. For the sake of space and simplicity of understanding, the figure presents a reduced content of the actual XMI.
  • Figure 5: Inline representation of the stand-off XMI from Figure \ref{['fig:standoff']}.
  • ...and 5 more figures