Table of Contents
Fetching ...

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Yifan Peng, Shankai Yan, Zhiyong Lu

TL;DR

This paper introduces BLUE, a standardized benchmark for biomedical NLP designed to evaluate pretrained language representations across diverse biomedical and clinical tasks. It compares BERT and ELMo baselines, showing that a BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves superior performance, especially when cross-genre data is used. BLUE comprises five tasks over ten corpora (sentence similarity, NER, relation extraction, document classification, inference) and provides data, code, and pretrained models to enable fair, reproducible evaluation. The findings highlight the importance of cross-domain pre-training for robust biomedical language understanding and establish BLUE as a resource to guide future improvements in biomedicine NLP representations.

Abstract

Inspired by the success of the General Language Understanding Evaluation benchmark, we introduce the Biomedical Language Understanding Evaluation (BLUE) benchmark to facilitate research in the development of pre-training language representations in the biomedicine domain. The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties. We also evaluate several baselines based on BERT and ELMo and find that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results. We make the datasets, pre-trained models, and codes publicly available at https://github.com/ncbi-nlp/BLUE_Benchmark.

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

TL;DR

This paper introduces BLUE, a standardized benchmark for biomedical NLP designed to evaluate pretrained language representations across diverse biomedical and clinical tasks. It compares BERT and ELMo baselines, showing that a BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves superior performance, especially when cross-genre data is used. BLUE comprises five tasks over ten corpora (sentence similarity, NER, relation extraction, document classification, inference) and provides data, code, and pretrained models to enable fair, reproducible evaluation. The findings highlight the importance of cross-domain pre-training for robust biomedical language understanding and establish BLUE as a resource to guide future improvements in biomedicine NLP representations.

Abstract

Inspired by the success of the General Language Understanding Evaluation benchmark, we introduce the Biomedical Language Understanding Evaluation (BLUE) benchmark to facilitate research in the development of pre-training language representations in the biomedicine domain. The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties. We also evaluate several baselines based on BERT and ELMo and find that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results. We make the datasets, pre-trained models, and codes publicly available at https://github.com/ncbi-nlp/BLUE_Benchmark.

Paper Structure

This paper contains 20 sections, 1 figure, 3 tables.

Figures (1)

  • Figure :