Table of Contents
Fetching ...

SciFive: a text-to-text transformer model for biomedical literature

Long N. Phan, James T. Anibal, Hieu Tran, Shaurya Chanana, Erol Bahadroglu, Alec Peltekian, Grégoire Altan-Bonnet

TL;DR

SciFive presents a domain-adapted text-to-text transformer (T5) pretrained on biomedical corpora and fine-tuned across NER, relation extraction, NLI, document classification, and QA. By using span-based masking, a SentencePiece vocabulary, and multitask learning, it achieves SOTA or near-SOTA performance on multiple tasks, especially QA, demonstrating the strong potential of generation-based approaches in biomedical NLP. The work highlights the benefits of long-output generation for biomedical information retrieval and points to future work in summarization and abstract generation within this domain.

Abstract

In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show that text-generation methods have significant potential in a broad array of biomedical NLP tasks, particularly those requiring longer, more complex outputs. Our results support the exploration of more difficult text generation tasks and the development of new methods in this area

SciFive: a text-to-text transformer model for biomedical literature

TL;DR

SciFive presents a domain-adapted text-to-text transformer (T5) pretrained on biomedical corpora and fine-tuned across NER, relation extraction, NLI, document classification, and QA. By using span-based masking, a SentencePiece vocabulary, and multitask learning, it achieves SOTA or near-SOTA performance on multiple tasks, especially QA, demonstrating the strong potential of generation-based approaches in biomedical NLP. The work highlights the benefits of long-output generation for biomedical information retrieval and points to future work in summarization and abstract generation within this domain.

Abstract

In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show that text-generation methods have significant potential in a broad array of biomedical NLP tasks, particularly those requiring longer, more complex outputs. Our results support the exploration of more difficult text generation tasks and the development of new methods in this area

Paper Structure

This paper contains 21 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An illustration on Span-based mask language modeling. For the input sentence, the set of tokens "IL","-","2", "kappa", "B",..."oxygen", "production" is randomly chosen for corruption, where consecutive tokens are counted as spans and replaced by a sentinel unique masked token <M>. The output sequence then consists of the concatenation of the dropped-out spans, sentinel tokens used to replace them in the input and the final sentinel token.
  • Figure 2: An illustration about Multi-task learning in Name-entity Recognition Tasks