ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals
Shania Mitra, Lei Huang, Manolis Kellis
TL;DR
ProteinRPN tackles the challenge of translating protein structure into function by introducing a graph-based region proposal network that identifies and refines functional regions as anchors on residue graphs. It combines k-hop GCNs for anchor discovery, node-drop pooling with penumbral cone attention, a functional attention layer, and a Graph Multiset Transformer to produce graph-level GO predictions, trained with a composite loss including SupCon and InfoNCE. Pretraining on PDBSite followed by evaluation on HEAL demonstrates statistically significant improvements in GO term prediction and functional residue localization, outperforming state-of-the-art baselines and enabling interpretable identification of functional regions. The approach advances protein structure-function understanding with a robust, scalable framework that couples region-level detection to principled graph-based representation learning.
Abstract
Protein function prediction is a crucial task in bioinformatics, with significant implications for understanding biological processes and disease mechanisms. While the relationship between sequence and function has been extensively explored, translating protein structure to function continues to present substantial challenges. Various models, particularly, CNN and graph-based deep learning approaches that integrate structural and functional data, have been proposed to address these challenges. However, these methods often fall short in elucidating the functional significance of key residues essential for protein functionality, as they predominantly adopt a retrospective perspective, leading to suboptimal performance. Inspired by region proposal networks in computer vision, we introduce the Protein Region Proposal Network (ProteinRPN) for accurate protein function prediction. Specifically, the region proposal module component of ProteinRPN identifies potential functional regions (anchors) which are refined through the hierarchy-aware node drop pooling layer favoring nodes with defined secondary structures and spatial proximity. The representations of the predicted functional nodes are enriched using attention mechanisms and subsequently fed into a Graph Multiset Transformer, which is trained with supervised contrastive (SupCon) and InfoNCE losses on perturbed protein structures. Our model demonstrates significant improvements in predicting Gene Ontology (GO) terms, effectively localizing functional residues within protein structures. The proposed framework provides a robust, scalable solution for protein function annotation, advancing the understanding of protein structure-function relationships in computational biology.
