Protein-Mamba: Biological Mamba Models for Protein Function Prediction
Bohao Xu, Yingzhou Lu, Yoshitaka Inoue, Namkyeong Lee, Tianfan Fu, Jintai Chen
TL;DR
The paper tackles protein function prediction under data scarcity and complexity by introducing Protein-Mamba, a two-stage model that performs self-supervised pretraining on unlabeled protein sequences using a Mamba (S4-based) backbone, followed by supervised fine-tuning on labeled downstream tasks. It demonstrates competitive performance against state-of-the-art baselines across a range of protein-function datasets, highlighting the value of unlabeled data through self-supervised learning. The findings suggest that combining pretraining with task-specific fine-tuning can advance drug discovery workflows by improving predictive accuracy while reducing labeling requirements. Practical implications include faster candidate prioritization in drug development and potential future work in integrating multi-omics data and early clinical profiling to further enhance predictive power.
Abstract
Protein function prediction is a pivotal task in drug discovery, significantly impacting the development of effective and safe therapeutics. Traditional machine learning models often struggle with the complexity and variability inherent in predicting protein functions, necessitating more sophisticated approaches. In this work, we introduce Protein-Mamba, a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction. The pre-training stage allows the model to capture general chemical structures and relationships from large, unlabeled datasets, while the fine-tuning stage refines these insights using specific labeled datasets, resulting in superior prediction performance. Our extensive experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods across a range of protein function datasets. This model's ability to effectively utilize both unlabeled and labeled data highlights the potential of self-supervised learning in advancing protein function prediction and offers a promising direction for future research in drug discovery.
