Table of Contents
Fetching ...

Structure-Informed Protein Language Model

Zuobai Zhang, Jiarui Lu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, Jian Tang

TL;DR

This work tackles the absence of explicit structural supervision in protein language models by injecting structure through remote homology detection, avoiding reliance on input structures. The authors fine-tune ESM-2 models on a SCOPe-fold remote-homology task to produce structure-informed embeddings (-S) and evaluate them via predictor-based and retriever-based methods across EC, GO, localization, and mutant datasets. Results show robust gains in enzyme and GO annotations for structure-informed models, with mixed outcomes on localization and mutant tasks depending on the structure-function relationship. The findings highlight the potential and limitations of structural distillation for improving protein function prediction and encourage further development of structure-aware PLMs.

Abstract

Protein language models are a powerful tool for learning protein representations through pre-training on vast protein sequence datasets. However, traditional protein language models lack explicit structural supervision, despite its relevance to protein function. To address this issue, we introduce the integration of remote homology detection to distill structural information into protein language models without requiring explicit protein structures as input. We evaluate the impact of this structure-informed training on downstream protein function prediction tasks. Experimental results reveal consistent improvements in function annotation accuracy for EC number and GO term prediction. Performance on mutant datasets, however, varies based on the relationship between targeted properties and protein structures. This underscores the importance of considering this relationship when applying structure-aware training to protein function prediction tasks. Code and model weights are available at https://github.com/DeepGraphLearning/esm-s.

Structure-Informed Protein Language Model

TL;DR

This work tackles the absence of explicit structural supervision in protein language models by injecting structure through remote homology detection, avoiding reliance on input structures. The authors fine-tune ESM-2 models on a SCOPe-fold remote-homology task to produce structure-informed embeddings (-S) and evaluate them via predictor-based and retriever-based methods across EC, GO, localization, and mutant datasets. Results show robust gains in enzyme and GO annotations for structure-informed models, with mixed outcomes on localization and mutant tasks depending on the structure-function relationship. The findings highlight the potential and limitations of structural distillation for improving protein function prediction and encourage further development of structure-aware PLMs.

Abstract

Protein language models are a powerful tool for learning protein representations through pre-training on vast protein sequence datasets. However, traditional protein language models lack explicit structural supervision, despite its relevance to protein function. To address this issue, we introduce the integration of remote homology detection to distill structural information into protein language models without requiring explicit protein structures as input. We evaluate the impact of this structure-informed training on downstream protein function prediction tasks. Experimental results reveal consistent improvements in function annotation accuracy for EC number and GO term prediction. Performance on mutant datasets, however, varies based on the relationship between targeted properties and protein structures. This underscores the importance of considering this relationship when applying structure-aware training to protein function prediction tasks. Code and model weights are available at https://github.com/DeepGraphLearning/esm-s.
Paper Structure (6 sections, 2 equations, 4 figures)

This paper contains 6 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Illustration of Training Procedure and Embeddings for Structure-Informed Protein Language Models. (A) Protein language models like ESM-2-650M are enhanced with structural information through training on remote homology detection tasks. This process results in the structure-informed model, ESM-2-650M-S, whose embeddings represent more structural characteristics. (B) We present UMAP embeddings of both ESM-2-650M and ESM-2-650M-S on the SCOPe dataset. After targeted training, ESM-2-650M-S embeddings show improved separability for different protein folds.
  • Figure 2: Results on function prediction tasks with various sizes of ESM-2 models as feature extractors. Structure-informed models are denoted with suffixes "-S" and highlighted with dots.
  • Figure 3: Fmax on function annotation with various sizes of ESM-2 models as retrievers with suffixes "-R". Structure-informed retrievers are denoted with suffixes "-RS" and highlighted with dots.
  • Figure 4: Results of EC annotation on NEW-392 and Price-149 test sets. Two proposed retrievers are in warm colors, whereas other baselines are in cold colors.