Table of Contents
Fetching ...

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan

TL;DR

BioT5 introduces a multi-modal pre-training framework that integrates text, molecule SELFIES, and protein FASTA with modality-specific vocabularies and a T5-style objective. By using wrapped biological text and translation tasks between bio-entities and descriptions, it addresses SMILES validity issues and underutilized contextual information. Evaluated on 15 downstream tasks, BioT5 achieves state-of-the-art or competitive results in molecule/protein property prediction, drug-target and protein-protein interactions, and cross-modal generation, demonstrating the value of combining chemical knowledge with natural language. The work provides a robust, scalable approach to cross-modal biological understanding and offers open-source resources for further research.

Abstract

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

TL;DR

BioT5 introduces a multi-modal pre-training framework that integrates text, molecule SELFIES, and protein FASTA with modality-specific vocabularies and a T5-style objective. By using wrapped biological text and translation tasks between bio-entities and descriptions, it addresses SMILES validity issues and underutilized contextual information. Evaluated on 15 downstream tasks, BioT5 achieves state-of-the-art or competitive results in molecule/protein property prediction, drug-target and protein-protein interactions, and cross-modal generation, demonstrating the value of combining chemical knowledge with natural language. The work provides a robust, scalable approach to cross-modal biological understanding and offers open-source resources for further research.

Abstract

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose , a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. utilizes SELFIES for robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at .
Paper Structure (42 sections, 6 figures, 7 tables)

This paper contains 42 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Representations of molecule and protein. Molecule can be represented by its name, bio-sequence (SMILES and SELFIES), and 2D graph structure. Protein can be represented by its name, corresponding gene name, bio-sequence (FASTA), and 3D structure.
  • Figure 2: Overview of BioT5 pre-training. The solid line refers to the "T5 objective", which aims to reconstruct the original unmasked input. Each consecutive span of masked tokens is replaced with a sentinel token, depicted as <M1>, <M2>, and <M3>. We apply this objective to molecule SELFIES (task #1), protein FASTA (task #2), general text (task #3), and wrapped text (task #4). The dashed line represents the bidirectional translation between bio-sequences and structured text description (task #5 and #6).
  • Figure 3: Case for tokenization. MolT5 processes "Br"(bromine atom) as "B" (boron atom) and "r", resulting in incorrect descriptions including tetraborate (related to "B"). BioT5 retains the chemically meaningful group "[Br-1]" as a complete token, thereby producing the correct output.
  • Figure 4: Wrapped text matching and mapping process.
  • Figure 5: Molecule captioning cases.
  • ...and 1 more figures