Table of Contents
Fetching ...

BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs

Yujia Wang, Jihong Guan, Wengen Li, Shuigeng Zhou, Xuhong Wang

TL;DR

BioBridge tackles the limited generalization of Protein Language Models and the domain-knowledge gap in large language models by integrating Domain-Incremental Continual Pre-training (DICP) with a PLM-Projector-LLM cross-modal pipeline. It aligns protein sequence embeddings with natural language through a frozen ESM2 encoder, a QFormer querying transformer, and contrastive learning, followed by end-to-end fine-tuning on protein-text pairs to enable unified multi-task reasoning. Empirically, BioBridge approaches dedicated PLMs on protein benchmarks and matches or exceeds general-language capabilities on tasks like MMLU and RACE, while maintaining cross-task robustness and interpretability through its cross-modal framework. The approach demonstrates that domain-specific adaptation can be achieved without sacrificing general reasoning, offering a scalable route to accelerate protein biology tasks and drug discovery.

Abstract

Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general-purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain-specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain-adaptive continual pretraining framework for protein understanding. This framework employs Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross-modal alignment is achieved via a PLM-Projector-LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an end-to-end optimization is adopted to uniformly support various tasks, including protein property prediction and knowledge question-answering. Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks, such as EC and BindingDB. It also achieves results on par with LLMs on general understanding tasks like MMLU and RACE. This showcases its innovative advantage of combining domain-specific adaptability with general-purpose language competency.

BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs

TL;DR

BioBridge tackles the limited generalization of Protein Language Models and the domain-knowledge gap in large language models by integrating Domain-Incremental Continual Pre-training (DICP) with a PLM-Projector-LLM cross-modal pipeline. It aligns protein sequence embeddings with natural language through a frozen ESM2 encoder, a QFormer querying transformer, and contrastive learning, followed by end-to-end fine-tuning on protein-text pairs to enable unified multi-task reasoning. Empirically, BioBridge approaches dedicated PLMs on protein benchmarks and matches or exceeds general-language capabilities on tasks like MMLU and RACE, while maintaining cross-task robustness and interpretability through its cross-modal framework. The approach demonstrates that domain-specific adaptation can be achieved without sacrificing general reasoning, offering a scalable route to accelerate protein biology tasks and drug discovery.

Abstract

Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general-purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain-specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain-adaptive continual pretraining framework for protein understanding. This framework employs Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross-modal alignment is achieved via a PLM-Projector-LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an end-to-end optimization is adopted to uniformly support various tasks, including protein property prediction and knowledge question-answering. Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks, such as EC and BindingDB. It also achieves results on par with LLMs on general understanding tasks like MMLU and RACE. This showcases its innovative advantage of combining domain-specific adaptability with general-purpose language competency.
Paper Structure (17 sections, 5 equations, 3 figures, 5 tables)

This paper contains 17 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: BioBridge Unlocks PLM-Level Performance for Qwen in Protein Research.
  • Figure 2: Case Study.Red fonts represent incorrect answers, and green fonts represent correct key answers.
  • Figure 3: illustration of our model: Domain-Incremental Continual Pre-training (DICP) adapts a general language model to biomedical data via continual pretraining on unlabeled domain-specific corpora. PLM-Projector maps protein embeddings into the language model’s space for protein-text alignment. End-to-End Fine-tuning connects protein and text tokens through joint optimization. Versatile Applications include alignment, classification, and generation tasks.