Table of Contents
Fetching ...

BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning

Yujuan Velvin Fu, Giridhar Kaushik Ramachandran, Namu Park, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen

TL;DR

This work proposes a unified prompting format for 7 important NLU tasks, curates an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and develops BioMistral-NLU, a generalizable medical NLU model through fine-tuning BioMistral on MNLU-Instruct.

Abstract

Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: BLUE and BLURB. Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs' generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.

BioMistral-NLU: Towards More Generalizable Medical Language Understanding through Instruction Tuning

TL;DR

This work proposes a unified prompting format for 7 important NLU tasks, curates an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and develops BioMistral-NLU, a generalizable medical NLU model through fine-tuning BioMistral on MNLU-Instruct.

Abstract

Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: BLUE and BLURB. Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs' generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.

Paper Structure

This paper contains 21 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Instruction-tuning dataset (MNLU-Instruct), system development, and downstream evaluation for BioMistral-NLU.
  • Figure 2: Average zero-shot performance on the 4 RE datasets, after instruction-tuning on 50k instances.
  • Figure 3: Average zero-shot performance on 6 biomedical NER datasets, when finetuned on different domains.