BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Hieu Tran; Zhichao Yang; Zonghai Yao; Hong Yu

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Hieu Tran, Zhichao Yang, Zonghai Yao, Hong Yu

TL;DR

BioInstruct introduces a domain-specific instruction-tuning approach to boost BioNLP performance by creating a large, automatically generated instruction dataset and applying parameter-efficient LoRA fine-tuning to LLaMA models. The study demonstrates consistent gains across QA, IE, and text generation tasks, with larger benefits when task categories are closely related and when diverse instruction data are used. Genomic-level analysis reveals that instruction tuning yields substantial improvements in medical QA and clinical text generation, while gains in information extraction are present but more task-dependent. The work positions instruction-tuning as a practical path to leverage biomedical knowledge in LLMs, offering a valuable resource for BioNLP applications and future multi-task transfer learning research.

Abstract

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

TL;DR

Abstract

Paper Structure (8 sections, 1 equation, 3 figures, 7 tables)

This paper contains 8 sections, 1 equation, 3 figures, 7 tables.

LLMs in BioNLP
Traditional Fine-Tuning vs Instruction Tuning
Instruction Tuning in BioNLP
How does instruction tuning perform on QA and NLI tasks?
How does instruction tuning perform on medication status extraction?
How does instruction tuning perform on clinical coreference resolution?
How does instruction tuning perform on Short Dialogue2Note Summarization?
How does instruction tuning perform on Doctor-Patient QA?

Figures (3)

Figure 1: Distribution of our BioInstruct dataset
Figure 2: Performance of different tasks in BioInstruct. Each scatter corresponds to a subtask to evaluate. Each colored dot inside the scatter represents a different training task. The black dot represents the baseline performance of LLaMA 2 7B without BioInstruct fine-tuning. The purple dot represents the performance of LLaMA 2 7B fine-tuned on all BioInstruct tasks. We then ablate BioInstruct. Above each scatter, we provide the best single task fine-tuned (dark blue, green, red) in the 1st row. In the 2nd row, we also provide the best fine-tuning task in addition to the specific task A, where task A is the same as the evaluation task (light blue, green, red).
Figure 3: Performance on different evaluation tasks when LLaMA 2 7B is fine-tuned on varying number of instruction samples in BioInstruct.

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

TL;DR

Abstract

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (3)