Structure-Informed Protein Language Model
Zuobai Zhang, Jiarui Lu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, Jian Tang
TL;DR
This work tackles the absence of explicit structural supervision in protein language models by injecting structure through remote homology detection, avoiding reliance on input structures. The authors fine-tune ESM-2 models on a SCOPe-fold remote-homology task to produce structure-informed embeddings (-S) and evaluate them via predictor-based and retriever-based methods across EC, GO, localization, and mutant datasets. Results show robust gains in enzyme and GO annotations for structure-informed models, with mixed outcomes on localization and mutant tasks depending on the structure-function relationship. The findings highlight the potential and limitations of structural distillation for improving protein function prediction and encourage further development of structure-aware PLMs.
Abstract
Protein language models are a powerful tool for learning protein representations through pre-training on vast protein sequence datasets. However, traditional protein language models lack explicit structural supervision, despite its relevance to protein function. To address this issue, we introduce the integration of remote homology detection to distill structural information into protein language models without requiring explicit protein structures as input. We evaluate the impact of this structure-informed training on downstream protein function prediction tasks. Experimental results reveal consistent improvements in function annotation accuracy for EC number and GO term prediction. Performance on mutant datasets, however, varies based on the relationship between targeted properties and protein structures. This underscores the importance of considering this relationship when applying structure-aware training to protein function prediction tasks. Code and model weights are available at https://github.com/DeepGraphLearning/esm-s.
