Table of Contents
Fetching ...

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan

TL;DR

BioT5+ tackles the need for generalized biological understanding by integrating IUPAC molecule names, expanding bio-text and molecular data, and applying multi-task instruction tuning with a specialized numerical-tokenization scheme. It pre-trains on diverse, modality-aware data and then fine-tunes across multiple molecule- and protein-oriented tasks, achieving state-of-the-art or competitive results on 21 benchmarks. The approach demonstrates improved grounded reasoning between textual descriptions and biological sequences, with strong performance in molecule and protein description generation, property prediction, and interaction tasks. While promising, the work notes limitations in cross-task generalization and multi-modal expansion, and highlights ethical considerations around molecule generation and related societal impacts.

Abstract

Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

TL;DR

BioT5+ tackles the need for generalized biological understanding by integrating IUPAC molecule names, expanding bio-text and molecular data, and applying multi-task instruction tuning with a specialized numerical-tokenization scheme. It pre-trains on diverse, modality-aware data and then fine-tunes across multiple molecule- and protein-oriented tasks, achieving state-of-the-art or competitive results on 21 benchmarks. The approach demonstrates improved grounded reasoning between textual descriptions and biological sequences, with strong performance in molecule and protein description generation, property prediction, and interaction tasks. While promising, the work notes limitations in cross-task generalization and multi-modal expansion, and highlights ethical considerations around molecule generation and related societal impacts.

Abstract

Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
Paper Structure (49 sections, 3 figures, 20 tables)

This paper contains 49 sections, 3 figures, 20 tables.

Figures (3)

  • Figure 1: (a): The overview of BioT5+ framework. (b) (c): the composition of BioT5+ downstream tasks, which is divided into two categories: (b) molecule-oriented tasks and (c) protein-oriented tasks. The names of the tasks, along with their instruction datasets and respective percentages, are annotated near each segment of the accompanying pie charts.
  • Figure 2: Performance (ROUGE-L) comparison on protein description generation tasks.
  • Figure 3: Overview of BioT5+ pre-training. The solid line refers to the masked span prediction task proposed by T5 t5. Each consecutive span of masked tokens is substituted with a sentinel token, represented as <M1>, <M2>, and <M3>. We apply this pre-training task to molecule IUPAC + SELFIES (task #1), molecule SELFIES (task #2), protein FASTA (task #3), general text (task #4), wrapped text (task #5), and bio-text (task #6). The dashed line symbolizes the bidirectional translation between structured text description and biological sequences. (task #7 and #8).