Low-resource Information Extraction with the European Clinical Case Corpus

Soumitra Ghosh; Begona Altuna; Saeed Farzi; Pietro Ferrazzi; Alberto Lavelli; Giulia Mezzanotte; Manuela Speranza; Bernardo Magnini

Low-resource Information Extraction with the European Clinical Case Corpus

Soumitra Ghosh, Begona Altuna, Saeed Farzi, Pietro Ferrazzi, Alberto Lavelli, Giulia Mezzanotte, Manuela Speranza, Bernardo Magnini

TL;DR

This work tackles data scarcity in clinical information extraction by introducing E3C-3.0, a multilingual corpus of clinical cases with annotated diseases and test-result relations, including both native texts and English-projected translations. It presents a semi-automatic extension pipeline that uses LLMs (e.g., GPT-4) for translation and annotation projection, followed by targeted human revision, enabling high-quality annotations in additional languages. Through extensive experiments with MedMT5 and Llama3-8B, the study shows that both fine-tuning on E3C-3.0 and cross-language transfer significantly boost IE performance in low-resource languages, while augmented multilingual training helps close gaps between native and projected data. The dataset and findings offer practical avenues for building robust multilingual clinical IE systems and highlight the value of combining domain-specific pretraining with multilingual transfer for healthcare NLP.

Abstract

We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .

Low-resource Information Extraction with the European Clinical Case Corpus

TL;DR

Abstract

Low-resource Information Extraction with the European Clinical Case Corpus

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)