Table of Contents
Fetching ...

CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation

Wenrui Gou, Wenhui Ge, Yang Tan, Mingchen Li, Guisheng Fan, Huiqun Yu

TL;DR

A structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), for the representation and origin evaluation of protein structures, and preliminary results indicate that structural sequences enriched with local structural features enable the model to capture more informative protein characteristics, thereby enhancing and refining protein representations.

Abstract

Protein structures are important for understanding their functions and interactions. Currently, many protein structure prediction methods are enriching the structure database. Discriminating the origin of structures is crucial for distinguishing between experimentally resolved and computationally predicted structures, evaluating the reliability of prediction methods, and guiding downstream biological studies. Building on works in structure prediction, We developed a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), to represent and discriminate the origin of protein structures. CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes, and is expected to be extended to more. Simultaneously, we utilized Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM. Preliminary experiments demonstrated that, compared to large-scale protein language models pre-trained on vast amounts of amino acid sequences, the "structure-sequence" enables the language model to learn more informative protein features, enhancing and optimizing structural representations. We have provided the code, model weights, and all related materials on https://github.com/GouWenrui/CPE-Pro-main.git.

CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation

TL;DR

A structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), for the representation and origin evaluation of protein structures, and preliminary results indicate that structural sequences enriched with local structural features enable the model to capture more informative protein characteristics, thereby enhancing and refining protein representations.

Abstract

Protein structures are important for understanding their functions and interactions. Currently, many protein structure prediction methods are enriching the structure database. Discriminating the origin of structures is crucial for distinguishing between experimentally resolved and computationally predicted structures, evaluating the reliability of prediction methods, and guiding downstream biological studies. Building on works in structure prediction, We developed a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), to represent and discriminate the origin of protein structures. CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes, and is expected to be extended to more. Simultaneously, we utilized Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM. Preliminary experiments demonstrated that, compared to large-scale protein language models pre-trained on vast amounts of amino acid sequences, the "structure-sequence" enables the language model to learn more informative protein features, enhancing and optimizing structural representations. We have provided the code, model weights, and all related materials on https://github.com/GouWenrui/CPE-Pro-main.git.

Paper Structure

This paper contains 13 sections, 17 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: a. Protein representation methods. Proteins can be input into the model in various forms, including amino acid sequences, feature maps, three-dimensional coordinates, functional descriptions, and sequences composed of structural tokens, capturing the multi-level features of proteins. b. Pre-training of SSLM. SSLM is pre-trained on over 100,000 protein structures from the Swiss-Prot database and trained on various masked language modeling tasks, learning the relationships between "structure-sequences" and their corresponding three-dimensional structural features, thereby effectively representing protein structural information. c. CPE-Pro model architecture. The CPE-Pro model integrates a pre-trained protein structure language model with a graph embedding module, inputting the combined representation into the GVP-GNN module for computation. The pooling module aggregates structural information using attention masking, enhancing the quality of the representation. Ultimately, a multilayer perceptron serves as the source discriminator, outputting predicted probabilities.
  • Figure 2: The perplexity $\downarrow$ of SSLM on the validation set. Among the 4 training strategies, the combination of a 25% masking rate with the 9:0:1 masking method shows superior performance. The original curve depicts how perplexity changes with training steps, while the smoothed curve illustrates its trend, reducing noise and providing a clearer view of the decreasing perplexity trend.
  • Figure 3: The pLDDT scores of the predicted protein structures in the dataset used for training CPE-Pro and the similarity between "structure-sequences".
  • Figure 4: Using the t-SNE method, the feature embeddings of four pre-trained versions of SSLM in the SCOPe database were dimensionally reduced and visualized on a two-dimensional plane.
  • Figure 5: Using the t-SNE method, the feature embeddings of various PLMs in the SCOPe database were dimensionally reduced and visualized in a two-dimensional plane.
  • ...and 1 more figures