Table of Contents
Fetching ...

Protein-Mamba: Biological Mamba Models for Protein Function Prediction

Bohao Xu, Yingzhou Lu, Yoshitaka Inoue, Namkyeong Lee, Tianfan Fu, Jintai Chen

TL;DR

The paper tackles protein function prediction under data scarcity and complexity by introducing Protein-Mamba, a two-stage model that performs self-supervised pretraining on unlabeled protein sequences using a Mamba (S4-based) backbone, followed by supervised fine-tuning on labeled downstream tasks. It demonstrates competitive performance against state-of-the-art baselines across a range of protein-function datasets, highlighting the value of unlabeled data through self-supervised learning. The findings suggest that combining pretraining with task-specific fine-tuning can advance drug discovery workflows by improving predictive accuracy while reducing labeling requirements. Practical implications include faster candidate prioritization in drug development and potential future work in integrating multi-omics data and early clinical profiling to further enhance predictive power.

Abstract

Protein function prediction is a pivotal task in drug discovery, significantly impacting the development of effective and safe therapeutics. Traditional machine learning models often struggle with the complexity and variability inherent in predicting protein functions, necessitating more sophisticated approaches. In this work, we introduce Protein-Mamba, a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction. The pre-training stage allows the model to capture general chemical structures and relationships from large, unlabeled datasets, while the fine-tuning stage refines these insights using specific labeled datasets, resulting in superior prediction performance. Our extensive experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods across a range of protein function datasets. This model's ability to effectively utilize both unlabeled and labeled data highlights the potential of self-supervised learning in advancing protein function prediction and offers a promising direction for future research in drug discovery.

Protein-Mamba: Biological Mamba Models for Protein Function Prediction

TL;DR

The paper tackles protein function prediction under data scarcity and complexity by introducing Protein-Mamba, a two-stage model that performs self-supervised pretraining on unlabeled protein sequences using a Mamba (S4-based) backbone, followed by supervised fine-tuning on labeled downstream tasks. It demonstrates competitive performance against state-of-the-art baselines across a range of protein-function datasets, highlighting the value of unlabeled data through self-supervised learning. The findings suggest that combining pretraining with task-specific fine-tuning can advance drug discovery workflows by improving predictive accuracy while reducing labeling requirements. Practical implications include faster candidate prioritization in drug development and potential future work in integrating multi-omics data and early clinical profiling to further enhance predictive power.

Abstract

Protein function prediction is a pivotal task in drug discovery, significantly impacting the development of effective and safe therapeutics. Traditional machine learning models often struggle with the complexity and variability inherent in predicting protein functions, necessitating more sophisticated approaches. In this work, we introduce Protein-Mamba, a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction. The pre-training stage allows the model to capture general chemical structures and relationships from large, unlabeled datasets, while the fine-tuning stage refines these insights using specific labeled datasets, resulting in superior prediction performance. Our extensive experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods across a range of protein function datasets. This model's ability to effectively utilize both unlabeled and labeled data highlights the potential of self-supervised learning in advancing protein function prediction and offers a promising direction for future research in drug discovery.
Paper Structure (17 sections, 1 equation, 1 figure, 5 tables)

This paper contains 17 sections, 1 equation, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Structure of amino acid. Amino acids are small organic molecules, where the central carbon atom (denoted $C_{\alpha}$ or alpha-carbon) connects to a carboxyl group (-COOH, C is $C_{\beta}$ or beta-carbon), a hydrogen atom (H), and a variable component called a side chain (denoted residue "R"). The side chain (or amino acid residue) determines the category of amino acids. There are 20 kinds of side chains, which means there are 20 kinds of amino acids.