Table of Contents
Fetching ...

Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

Adrien Bazoge, Emmanuel Morin, Beatrice Daille, Pierre-Antoine Gourraud

TL;DR

This study addresses the challenge of processing long French biomedical and clinical documents by adapting Longformer-based models and comparing three pre-training strategies. It demonstrates that continuing or converting from existing English clinical models generally yields better performance for long-sequence tasks than training from scratch, while BERT-based models remain most efficient for NER. Across 16 downstream tasks, long-sequence French biomedical models provide broad performance gains, with cross-lingual transfer (English clinical data plus French biomedical data) offering a practical path when French Longformer is unavailable. The work highlights the importance of model architecture for long documents and underscores substantial computational costs for pre-training, while providing open-source resources for replication. It advances practical approaches for French biomedical NLP in scenarios with long documents and limited French data.

Abstract

Recently, pretrained language models based on BERT have been introduced for the French biomedical domain. Although these models have achieved state-of-the-art results on biomedical and clinical NLP tasks, they are constrained by a limited input sequence length of 512 tokens, which poses challenges when applied to clinical notes. In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks spanning both biomedical and clinical domains. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform both converting a French biomedical BERT to the Longformer architecture and pre-training a French biomedical Longformer from scratch. The results underscore that long-sequence French biomedical models improve performance across most downstream tasks regardless of sequence length, but BERT based models remain the most efficient for named entity recognition tasks.

Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study

TL;DR

This study addresses the challenge of processing long French biomedical and clinical documents by adapting Longformer-based models and comparing three pre-training strategies. It demonstrates that continuing or converting from existing English clinical models generally yields better performance for long-sequence tasks than training from scratch, while BERT-based models remain most efficient for NER. Across 16 downstream tasks, long-sequence French biomedical models provide broad performance gains, with cross-lingual transfer (English clinical data plus French biomedical data) offering a practical path when French Longformer is unavailable. The work highlights the importance of model architecture for long documents and underscores substantial computational costs for pre-training, while providing open-source resources for replication. It advances practical approaches for French biomedical NLP in scenarios with long documents and limited French data.

Abstract

Recently, pretrained language models based on BERT have been introduced for the French biomedical domain. Although these models have achieved state-of-the-art results on biomedical and clinical NLP tasks, they are constrained by a limited input sequence length of 512 tokens, which poses challenges when applied to clinical notes. In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks spanning both biomedical and clinical domains. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform both converting a French biomedical BERT to the Longformer architecture and pre-training a French biomedical Longformer from scratch. The results underscore that long-sequence French biomedical models improve performance across most downstream tasks regardless of sequence length, but BERT based models remain the most efficient for named entity recognition tasks.
Paper Structure (39 sections, 1 figure, 5 tables)

This paper contains 39 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Average attention weights for each word position on the test set of (a) the aHF classification dataset, and (b) the DEFT-2021 dataset. The attention weights are obtained from the [CLS] token used for classification by summing the attention weights of all attention heads in the last layer of the model. These weights represent the most important words after they have already incorporated contextual information from other words based off the 12 self-attention layers of the model. In (a) The aHF classification task is annotated at both document-level (Yes/No) and sequence-level. At the sequence-level, specific sequences of words that justify the document-level classification are annotated. This enables a comparison between human annotations and attention mechanisms of the models.