Table of Contents
Fetching ...

Protein Secondary Structure Prediction Using 3D Graphs and Relation-Aware Message Passing Transformers

Disha Varshney, Samarth Garg, Sarthak Tyagi, Deeksha Varshney, Nayan Deep, Asif Ekbal

TL;DR

This work tackles protein secondary structure prediction from sequence by introducing SSRGNet, a framework that fuses DistilProtBert sequence embeddings with a Relational Graph Convolutional Network operating on multi-relational protein graphs. The method explicitly encodes 3D structural information through three edge types and uses a parallel fusion strategy to combine sequence and structure cues, achieving improved F1 scores on the NetSurfP-2.0 benchmarks across CB513, TS115, and CASP12. Key contributions include the design of a residue-graph representation, a two-layer R-GCN for relational message passing, and an evaluation showing structure-aware encoding enhances PSSP performance beyond sequence-only baselines. The approach has implications for more accurate secondary- and tertiary-structure predictions and could inform protein function analysis and drug design through better structural representations.

Abstract

In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.

Protein Secondary Structure Prediction Using 3D Graphs and Relation-Aware Message Passing Transformers

TL;DR

This work tackles protein secondary structure prediction from sequence by introducing SSRGNet, a framework that fuses DistilProtBert sequence embeddings with a Relational Graph Convolutional Network operating on multi-relational protein graphs. The method explicitly encodes 3D structural information through three edge types and uses a parallel fusion strategy to combine sequence and structure cues, achieving improved F1 scores on the NetSurfP-2.0 benchmarks across CB513, TS115, and CASP12. Key contributions include the design of a residue-graph representation, a two-layer R-GCN for relational message passing, and an evaluation showing structure-aware encoding enhances PSSP performance beyond sequence-only baselines. The approach has implications for more accurate secondary- and tertiary-structure predictions and could inform protein function analysis and drug design through better structural representations.

Abstract

In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.

Paper Structure

This paper contains 32 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Graph Representation of Protein Sequence: The figure illustrates the hierarchical levels of protein structures, ranging from the primary amino acid sequence to secondary, tertiary, and quaternary structures. The primary sequence is then transformed into a protein graph, where amino acids are represented as nodes and edges capture different relationships: R1 (sequential relationship) with $E_{R1}$=1 shown in blue and with $E_{R1}$=-1 in yellow, R2 (spatial proximity) in red, and R3 (local environment) in green. This graph-based representation enables structured learning of protein properties using a Relational Graph Convolutional Network (RGCN).
  • Figure 2: SSRGNet Model Overview: The diagram illustrates the SSRGNet model, which combines a DistilProtBert model, and an R-GCN (Relational Graph Convolution Network) model. The amino acid sequence is converted into a graph representation featuring three types of edges, as shown at the bottom, and passed into the respective encoders. Features obtained for the amino acid and the protein graph are finally fused together to predict the three-state and eight-state secondary structure of the protein.
  • Figure 3: Comparative Architectural Schemes of Fusion Techniques. The diagram illustrates three distinct methodologies: (a) Series Fusion, where components process information sequentially (element-wise addition); (b) Parallel Fusion, where outputs are concatenated along the last dimension, and (c) Cross Fusion, characterized by intertwined (multi-head attention) processing layers.
  • Figure 4: Confusion matrices for SSRGNet on different datasets for eight-state prediction.
  • Figure 5: Results for the Ablation Study on different evaluation metrics. Comparison of Loss, Accuracy and F1-scores for different fusion methods on various datasets.