Table of Contents
Fetching ...

ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals

Shania Mitra, Lei Huang, Manolis Kellis

TL;DR

ProteinRPN tackles the challenge of translating protein structure into function by introducing a graph-based region proposal network that identifies and refines functional regions as anchors on residue graphs. It combines k-hop GCNs for anchor discovery, node-drop pooling with penumbral cone attention, a functional attention layer, and a Graph Multiset Transformer to produce graph-level GO predictions, trained with a composite loss including SupCon and InfoNCE. Pretraining on PDBSite followed by evaluation on HEAL demonstrates statistically significant improvements in GO term prediction and functional residue localization, outperforming state-of-the-art baselines and enabling interpretable identification of functional regions. The approach advances protein structure-function understanding with a robust, scalable framework that couples region-level detection to principled graph-based representation learning.

Abstract

Protein function prediction is a crucial task in bioinformatics, with significant implications for understanding biological processes and disease mechanisms. While the relationship between sequence and function has been extensively explored, translating protein structure to function continues to present substantial challenges. Various models, particularly, CNN and graph-based deep learning approaches that integrate structural and functional data, have been proposed to address these challenges. However, these methods often fall short in elucidating the functional significance of key residues essential for protein functionality, as they predominantly adopt a retrospective perspective, leading to suboptimal performance. Inspired by region proposal networks in computer vision, we introduce the Protein Region Proposal Network (ProteinRPN) for accurate protein function prediction. Specifically, the region proposal module component of ProteinRPN identifies potential functional regions (anchors) which are refined through the hierarchy-aware node drop pooling layer favoring nodes with defined secondary structures and spatial proximity. The representations of the predicted functional nodes are enriched using attention mechanisms and subsequently fed into a Graph Multiset Transformer, which is trained with supervised contrastive (SupCon) and InfoNCE losses on perturbed protein structures. Our model demonstrates significant improvements in predicting Gene Ontology (GO) terms, effectively localizing functional residues within protein structures. The proposed framework provides a robust, scalable solution for protein function annotation, advancing the understanding of protein structure-function relationships in computational biology.

ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals

TL;DR

ProteinRPN tackles the challenge of translating protein structure into function by introducing a graph-based region proposal network that identifies and refines functional regions as anchors on residue graphs. It combines k-hop GCNs for anchor discovery, node-drop pooling with penumbral cone attention, a functional attention layer, and a Graph Multiset Transformer to produce graph-level GO predictions, trained with a composite loss including SupCon and InfoNCE. Pretraining on PDBSite followed by evaluation on HEAL demonstrates statistically significant improvements in GO term prediction and functional residue localization, outperforming state-of-the-art baselines and enabling interpretable identification of functional regions. The approach advances protein structure-function understanding with a robust, scalable framework that couples region-level detection to principled graph-based representation learning.

Abstract

Protein function prediction is a crucial task in bioinformatics, with significant implications for understanding biological processes and disease mechanisms. While the relationship between sequence and function has been extensively explored, translating protein structure to function continues to present substantial challenges. Various models, particularly, CNN and graph-based deep learning approaches that integrate structural and functional data, have been proposed to address these challenges. However, these methods often fall short in elucidating the functional significance of key residues essential for protein functionality, as they predominantly adopt a retrospective perspective, leading to suboptimal performance. Inspired by region proposal networks in computer vision, we introduce the Protein Region Proposal Network (ProteinRPN) for accurate protein function prediction. Specifically, the region proposal module component of ProteinRPN identifies potential functional regions (anchors) which are refined through the hierarchy-aware node drop pooling layer favoring nodes with defined secondary structures and spatial proximity. The representations of the predicted functional nodes are enriched using attention mechanisms and subsequently fed into a Graph Multiset Transformer, which is trained with supervised contrastive (SupCon) and InfoNCE losses on perturbed protein structures. Our model demonstrates significant improvements in predicting Gene Ontology (GO) terms, effectively localizing functional residues within protein structures. The proposed framework provides a robust, scalable solution for protein function annotation, advancing the understanding of protein structure-function relationships in computational biology.
Paper Structure (22 sections, 8 equations, 2 figures, 2 tables)

This paper contains 22 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The ProteinRPN model predicts protein function by converting protein sequences into residue graphs, processing them through a k-layer GCN to identify functional subgraphs (anchors), refining these subgraphs via domain knowledge and hierarchy-aware attention mechanisms, and categorizing them into GO terms using a GMT layer
  • Figure 2: Visual Demonstration of Region Proposal Network detected residues in proteins (a) 2BCC-B and (b) 2CHG-A