Table of Contents
Fetching ...

Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting

Marco Naguib, Xavier Tannier, Aurélie Névéol

TL;DR

This study investigates few-shot clinical NER across English, French, and Spanish by comparing prompt-based generative LLMs with fine-tuned MLMs. It introduces a systematic, LOOCV-driven approach to optimize tagging prompts for NER and evaluates 14 datasets spanning general-domain and clinical domains. The results show masked language models consistently outperform generative prompting in the clinical domain and do so with substantially lower environmental impact, while prompting sometimes remains competitive in general-domain tasks. The findings suggest MLM-based fine-tuning is currently more suitable for clinical NER under true few-shot constraints, with practical implications for low-resource information extraction and eco-conscious modeling, while highlighting limitations such as data contamination and random variability in prompt configuration.

Abstract

Large language models (LLMs) have become the preferred solution for many natural language processing tasks. In low-resource environments such as specialized domains, their few-shot capabilities are expected to deliver high performance. Named Entity Recognition (NER) is a critical task in information extraction that is not covered in recent LLM benchmarks. There is a need for better understanding the performance of LLMs for NER in a variety of settings including languages other than English. This study aims to evaluate generative LLMs, employed through prompt engineering, for few-shot clinical NER. %from the perspective of F1 performance and environmental impact. We compare 13 auto-regressive models using prompting and 16 masked models using fine-tuning on 14 NER datasets covering English, French and Spanish. While prompt-based auto-regressive models achieve competitive F1 for general NER, they are outperformed within the clinical domain by lighter biLSTM-CRF taggers based on masked models. Additionally, masked models exhibit lower environmental impact compared to auto-regressive models. Findings are consistent across the three languages studied, which suggests that LLM prompting is not yet suited for NER production in the clinical domain.

Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting

TL;DR

This study investigates few-shot clinical NER across English, French, and Spanish by comparing prompt-based generative LLMs with fine-tuned MLMs. It introduces a systematic, LOOCV-driven approach to optimize tagging prompts for NER and evaluates 14 datasets spanning general-domain and clinical domains. The results show masked language models consistently outperform generative prompting in the clinical domain and do so with substantially lower environmental impact, while prompting sometimes remains competitive in general-domain tasks. The findings suggest MLM-based fine-tuning is currently more suitable for clinical NER under true few-shot constraints, with practical implications for low-resource information extraction and eco-conscious modeling, while highlighting limitations such as data contamination and random variability in prompt configuration.

Abstract

Large language models (LLMs) have become the preferred solution for many natural language processing tasks. In low-resource environments such as specialized domains, their few-shot capabilities are expected to deliver high performance. Named Entity Recognition (NER) is a critical task in information extraction that is not covered in recent LLM benchmarks. There is a need for better understanding the performance of LLMs for NER in a variety of settings including languages other than English. This study aims to evaluate generative LLMs, employed through prompt engineering, for few-shot clinical NER. %from the perspective of F1 performance and environmental impact. We compare 13 auto-regressive models using prompting and 16 masked models using fine-tuning on 14 NER datasets covering English, French and Spanish. While prompt-based auto-regressive models achieve competitive F1 for general NER, they are outperformed within the clinical domain by lighter biLSTM-CRF taggers based on masked models. Additionally, masked models exhibit lower environmental impact compared to auto-regressive models. Findings are consistent across the three languages studied, which suggests that LLM prompting is not yet suited for NER production in the clinical domain.
Paper Structure (37 sections, 5 figures, 12 tables)

This paper contains 37 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Example of a tagging prompt, used in the main experiment (top) and a self-verification prompt (bottom) for detecting DISO mentions in n2c2-2019
  • Figure 2: Carbon emission (g) incurred by resolving ConLL-2003 using three models
  • Figure 3: Performance of models on English. The general performance is the average of micro-F1 obtained on WikiNER-en and CoNLL-2003. The clinical performance is the average on E3C-en, n2c2 and NCBI-Disease. The red lines represent the skyline performance obtained with the entirety of each training dataset.
  • Figure 4: Performance of models on French. The general performance is the average of micro-F1 obtained on WikiNER-fr and QuaeroFrenchPress. The clinical performance is the average on E3C-fr, EMEA and MEDLINE. The red lines represent the skyline performance obtained with the entirety of each training dataset.
  • Figure 5: Performance of models on Spanish. The general performance is the average of micro-F1 obtained on WikiNER-es and CoNLL-2002. The clinical performance is the average on E3C-es and CWL. The red lines represent the skyline performance obtained with the entirety of each training dataset.