Table of Contents
Fetching ...

Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang

TL;DR

This work tackles the problem of predicting compound-protein affinity with limited target-specific data while delivering interpretable explanations. It introduces a structure-aware graph neural network that leverages activity-cliff matched molecular pairs to learn from both common scaffolds and uncommon decorations, incorporating group lasso and sparse group lasso regularizations in the loss to prune subgraphs and improve feature attribution. Empirical results on three Src kinase targets show improved predictive performance (lower RMSE, higher PCC) and enhanced explainability, with substantial gains in global-direction attribution metrics and more accurate atom-level coloring when regularization is applied. The approach advances drug discovery workflows by linking SAR-relevant substructures to affinity changes and providing stable, interpretable subgraph-level insights to guide lead optimization.

Abstract

Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.

Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

TL;DR

This work tackles the problem of predicting compound-protein affinity with limited target-specific data while delivering interpretable explanations. It introduces a structure-aware graph neural network that leverages activity-cliff matched molecular pairs to learn from both common scaffolds and uncommon decorations, incorporating group lasso and sparse group lasso regularizations in the loss to prune subgraphs and improve feature attribution. Empirical results on three Src kinase targets show improved predictive performance (lower RMSE, higher PCC) and enhanced explainability, with substantial gains in global-direction attribution metrics and more accurate atom-level coloring when regularization is applied. The approach advances drug discovery workflows by linking SAR-relevant substructures to affinity changes and providing stable, interpretable subgraph-level insights to guide lead optimization.

Abstract

Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.

Paper Structure

This paper contains 13 sections, 5 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: An illustration for an example pair of molecules targeting Src kinase 1O42 with activity cliffs ($\Delta pK_i = 3.85$). The paired two molecules share common and uncommon substructures. Common substructures consist of uncolored nodes (atoms) and edges (bonds), and uncommon substituents consist of colored nodes and edges (blue & red circles for sites in the pair)
  • Figure 2: Model structure illustration. Considering a pair of compounds $c_i$ and $c_j$ that share a scaffold in the red circle and decorations in the blue circle, GNN and MPNN were applied to learn latent node representations for both common and uncommon nodes. Such node-level information was aggregated to predict graph-level drug-protein binding affinity, and then the mean squared errors for predicted and experimental ones were calculated. Additionally, masking functions for common and uncommon nodes and readout functions were applied to normalize the information, and multilayer perceptrons (MLPs) were combined with group lasso and sparse group lasso to create subgraph node loss for both common and uncommon nodes. The two loss functions were minimized during the training process. "CN" for common nodes and structure, "UCN" for uncommon nodes and structure, "N" for both common and uncommon nodes. Regularization methods were also applied to prune and select node-level information in activity cliff pairs. Group lasso considers only group-level sparsity (nodes within the same subgroup having the same color depth representing the same weights after pruning), and sparse group lasso considers both group-level and within-group sparsity (nodes within the same subgroup having different color depth representing various weights after pruning).
  • Figure 3: Model performance for the molecules in testing sets targeting three kinases respectively, under different loss function settings via 5-fold cross-validation. Numbers for each bar plot indicate the averaged RMSE or PCC values for each loss function setting, and error bars represent standard deviation. It shows the consistent trends of RMSE decreasing and PCC increasing as more node information (from only uncommon nodes to both common and uncommon nodes) and regularization methods were added (from no penalty items for loss functions to adding group lasso and sparse group lasso). This suggests that the models using loss functions with regularization perform better on affinity prediction.
  • Figure 4: Comparison of averaged global direction scores for $\mathcal{L}_{\text{MSE}} + \mathcal{L}_{\text{N}}$ with and without group lasso by scatter plot with connecting lines. The plot was used for comparing graph-level global direction scores to show the distribution difference of averaged predicted global direction values with $\mathcal{L}_{\text{MSE}} + \mathcal{L}_{\text{N}}$ (x-axis) and $\mathcal{L}_{\text{MSE}} + \mathcal{L}_{\text{N}}$ with group lasso (y‑axis) loss functions under different minimum common substructure thresholds. Compound pairs are considered at the minimum 50% MCS threshold and from 50% to 100% in 5% increments. The text box reports the increased percentage of global direction scores with group lasso regularization for all the feature attribution including CAM (Figure. \ref{['fig:two_line_density']}A) having 47.13% increase and a significant Wilcoxon test p-value of 0.0002; Grad-CAM (Figure. \ref{['fig:two_line_density']}B) having 14.83% increase and p-value of 0.0059; Gradient $\times$ Input (Figure. \ref{['fig:two_line_density']}C) having 15.4% increase and p-value of 0.002; IG (Figure. \ref{['fig:two_line_density']}D) having 8.49% increase and p-value of 0.0098.
  • Figure 5: Comparison of the atom-level accuracy in node coloring for ligands binding to the three kinases. It shows the ground truth feature attribution labels of atom coloring and prediction under three situations: feature attribution approach Grad-CAM with MSE loss and node loss functions without penalty, and the one using Grad-CAM with the two loss items with sparse group lasso. When the sparse group lasso was applied for the loss functions, the predicted atom coloring was much more consistent with the ground-truth feature attribution labels of coloring.
  • ...and 3 more figures