Table of Contents
Fetching ...

LLM, Reporting In! Medical Information Extraction Across Prompting, Fine-tuning and Post-correction

Ikram Belmadani, Parisa Nazari Hashemi, Thomas Sebbag, Benoit Favre, Guillaume Fortier, Solen Quiniou, Emmanuel Morin, Richard Dufour

TL;DR

The paper analyzes three complementary strategies for French biomedical NER and health-event extraction in a very low-resource, few-shot setting: (i) in-context learning with GPT-4.1 using automatically selected demonstrations and a summarized annotation guide, (ii) fine-tuning a universal REN model (GLiNER-biomed) on a synthetic corpus with post-verification by an LLM, and (iii) fine-tuning an open LLaMA model (LoRA) on the same synthetic data for a second NER pipeline. The event extraction task leverages the same GPT-4.1 ICL approach with a prompt that reuses the guideline summary. Results show GPT-4.1 achieves a macro-F1 of 61.53% for NER and 15.02% for events, underscoring the critical role of carefully crafted prompts in ultra-low-resource settings; synthetic data and post-verification offer limited improvements, and event extraction remains a challenging, multi-level task heavily dependent on the quality of entity recognition.

Abstract

This work presents our participation in the EvalLLM 2025 challenge on biomedical Named Entity Recognition (NER) and health event extraction in French (few-shot setting). For NER, we propose three approaches combining large language models (LLMs), annotation guidelines, synthetic data, and post-processing: (1) in-context learning (ICL) with GPT-4.1, incorporating automatic selection of 10 examples and a summary of the annotation guidelines into the prompt, (2) the universal NER system GLiNER, fine-tuned on a synthetic corpus and then verified by an LLM in post-processing, and (3) the open LLM LLaMA-3.1-8B-Instruct, fine-tuned on the same synthetic corpus. Event extraction uses the same ICL strategy with GPT-4.1, reusing the guideline summary in the prompt. Results show GPT-4.1 leads with a macro-F1 of 61.53% for NER and 15.02% for event extraction, highlighting the importance of well-crafted prompting to maximize performance in very low-resource scenarios.

LLM, Reporting In! Medical Information Extraction Across Prompting, Fine-tuning and Post-correction

TL;DR

The paper analyzes three complementary strategies for French biomedical NER and health-event extraction in a very low-resource, few-shot setting: (i) in-context learning with GPT-4.1 using automatically selected demonstrations and a summarized annotation guide, (ii) fine-tuning a universal REN model (GLiNER-biomed) on a synthetic corpus with post-verification by an LLM, and (iii) fine-tuning an open LLaMA model (LoRA) on the same synthetic data for a second NER pipeline. The event extraction task leverages the same GPT-4.1 ICL approach with a prompt that reuses the guideline summary. Results show GPT-4.1 achieves a macro-F1 of 61.53% for NER and 15.02% for events, underscoring the critical role of carefully crafted prompts in ultra-low-resource settings; synthetic data and post-verification offer limited improvements, and event extraction remains a challenging, multi-level task heavily dependent on the quality of entity recognition.

Abstract

This work presents our participation in the EvalLLM 2025 challenge on biomedical Named Entity Recognition (NER) and health event extraction in French (few-shot setting). For NER, we propose three approaches combining large language models (LLMs), annotation guidelines, synthetic data, and post-processing: (1) in-context learning (ICL) with GPT-4.1, incorporating automatic selection of 10 examples and a summary of the annotation guidelines into the prompt, (2) the universal NER system GLiNER, fine-tuned on a synthetic corpus and then verified by an LLM in post-processing, and (3) the open LLM LLaMA-3.1-8B-Instruct, fine-tuned on the same synthetic corpus. Event extraction uses the same ICL strategy with GPT-4.1, reusing the guideline summary in the prompt. Results show GPT-4.1 leads with a macro-F1 of 61.53% for NER and 15.02% for event extraction, highlighting the importance of well-crafted prompting to maximize performance in very low-resource scenarios.

Paper Structure

This paper contains 27 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Répartition des entités dans les jeux de données (entraînement et synthétique).
  • Figure 2: Pipeline pour chaque run soumis à la campagne EvalLLM.
  • Figure 3: Performances en F1-score par étiquette sur le jeu de données de test, comparées entre les trois configurations expérimentales (Run 1, Run 2 et Run 3).