Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization
Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang
TL;DR
This work tackles the problem of predicting compound-protein affinity with limited target-specific data while delivering interpretable explanations. It introduces a structure-aware graph neural network that leverages activity-cliff matched molecular pairs to learn from both common scaffolds and uncommon decorations, incorporating group lasso and sparse group lasso regularizations in the loss to prune subgraphs and improve feature attribution. Empirical results on three Src kinase targets show improved predictive performance (lower RMSE, higher PCC) and enhanced explainability, with substantial gains in global-direction attribution metrics and more accurate atom-level coloring when regularization is applied. The approach advances drug discovery workflows by linking SAR-relevant substructures to affinity changes and providing stable, interpretable subgraph-level insights to guide lead optimization.
Abstract
Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.
