Table of Contents
Fetching ...

Optimal message passing for molecular prediction is simple, attentive and spatial

Alma C. Castaneda-Leautaud, Rommie E. Amaro

TL;DR

This work investigates minimalist, bidirectional message-passing neural networks for molecular property prediction, demonstrating that simpler architectures with edge-aware attention can achieve state-of-the-art results. By systematically abating self-nodes, introducing attention, and integrating 3D descriptors with 2D graphs, the authors show that dataset diversity modulates the need for additional components and that 2D representations augmented with carefully chosen 3D features can match fully 3D approaches while reducing computational cost by over 50%. Feature selection reveals buried volume and radius of gyration as consistently informative 3D-aware features, while traditional element-like features often hurt performance due to distributional biases. The ABMP model, combining bidirectional passing and edge-aware attention, delivers the strongest performance across multiple MoleculeNet benchmarks, highlighting practical implications for fast, scalable drug discovery workflows. Overall, the study provides a principled, low-complexity pathway to high-performance molecular prediction with actionable guidance on feature engineering and model design.

Abstract

Strategies to improve the predicting performance of Message-Passing Neural-Networks for molecular property predictions can be achieved by simplifying how the message is passed and by using descriptors that capture multiple aspects of molecular graphs. In this work, we designed model architectures that achieved state-of-the-art performance, surpassing more complex models such as those pre-trained on external databases. We assessed dataset diversity to complement our performance results, finding that structural diversity influences the need for additional components in our MPNNs and feature sets. In most datasets, our best architecture employs bidirectional message-passing with an attention mechanism, applied to a minimalist message formulation that excludes self-perception, highlighting that relatively simpler models, compared to classical MPNNs, yield higher class separability. In contrast, we found that convolution normalization factors do not benefit the predictive power in all the datasets tested. This was corroborated in both global and node-level outputs. Additionally, we analyzed the influence of both adding spatial features and working with 3D graphs, finding that 2D molecular graphs are sufficient when complemented with appropriately chosen 3D descriptors. This approach not only preserves predictive performance but also reduces computational cost by over 50%, making it particularly advantageous for high-throughput screening campaigns.

Optimal message passing for molecular prediction is simple, attentive and spatial

TL;DR

This work investigates minimalist, bidirectional message-passing neural networks for molecular property prediction, demonstrating that simpler architectures with edge-aware attention can achieve state-of-the-art results. By systematically abating self-nodes, introducing attention, and integrating 3D descriptors with 2D graphs, the authors show that dataset diversity modulates the need for additional components and that 2D representations augmented with carefully chosen 3D features can match fully 3D approaches while reducing computational cost by over 50%. Feature selection reveals buried volume and radius of gyration as consistently informative 3D-aware features, while traditional element-like features often hurt performance due to distributional biases. The ABMP model, combining bidirectional passing and edge-aware attention, delivers the strongest performance across multiple MoleculeNet benchmarks, highlighting practical implications for fast, scalable drug discovery workflows. Overall, the study provides a principled, low-complexity pathway to high-performance molecular prediction with actionable guidance on feature engineering and model design.

Abstract

Strategies to improve the predicting performance of Message-Passing Neural-Networks for molecular property predictions can be achieved by simplifying how the message is passed and by using descriptors that capture multiple aspects of molecular graphs. In this work, we designed model architectures that achieved state-of-the-art performance, surpassing more complex models such as those pre-trained on external databases. We assessed dataset diversity to complement our performance results, finding that structural diversity influences the need for additional components in our MPNNs and feature sets. In most datasets, our best architecture employs bidirectional message-passing with an attention mechanism, applied to a minimalist message formulation that excludes self-perception, highlighting that relatively simpler models, compared to classical MPNNs, yield higher class separability. In contrast, we found that convolution normalization factors do not benefit the predictive power in all the datasets tested. This was corroborated in both global and node-level outputs. Additionally, we analyzed the influence of both adding spatial features and working with 3D graphs, finding that 2D molecular graphs are sufficient when complemented with appropriately chosen 3D descriptors. This approach not only preserves predictive performance but also reduces computational cost by over 50%, making it particularly advantageous for high-throughput screening campaigns.

Paper Structure

This paper contains 34 sections, 23 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Diagram illustrating the general architecture of our MPNN for molecular classification. The message is processed in a batch-wise manner including source ($x_i$), destination ($x_j$) nodes, and edge attributes ($e_{ij}$), which are first processed by a Multi-Layer Perceptron (MLP), detailed in the top inset. The message-passing module depends on the tested model and optionally outputs node-level embeddings for colormap visualization. A global max pooling operation aggregates the node-level outputs into a single molecular representation, which is then concatenated with global features. This pooled representation is subsequently passed through another MLP to produce a single scalar logit in the case of classification task (C/1), which is processed by the Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss) operator or a scalar value for regression (R/X) with error calculated using Mean-Square Error Loss (MSELoss). The gradient is optimized using Adam.
  • Figure 2: Ranked features based on the cumulative sum of their positions across successive rounds of backward elimination. In each round, features were ordered from lower to higher F1 score upon removal, and points were assigned accordingly. Final ranks were determined by the total accumulated points, with higher scores indicating greater importance. The heatmap visually highlights the consistency of feature rankings across datasets.
  • Figure 3: Frequency histograms comparing single-valued features per element with 3D features that are 3D-enviornment aware working with the TRPA1 set. In A) the distribution for the normalized atomic number, the highest frequency count corresponds to the carbon element. Hydrogens were eliminated during data processing.B) The standardized buried volume node feature distribution shows a gaussian-like distribution.
  • Figure 4: Ablation study to address the influence of the spatial arrangement of the molecules across the tested datasets (BACE, BBBP, TRPA1 and Lipophilicity). We included 3D conformations with added gaussian noise (0.5 Å std. dev.) termed Noisy-3D, 2D conformations only and spatial optimizations using the Merck Molecular Force-Field (MMFF) and the Universal Force-Field (UFF). The plots are separated by classification metrics (AUC, Accuracy, F1) and regression (RMSE). Error bars represent margins of errors (95% confidence) over multiple runs.
  • Figure 5: Pareto plots for the BMP model showing the convergence of the dual-directed hyperparameter optimization protocol using TPE sampler and run using the Optuna package. A) TRPA1 dataset panel visualizing the relationships between the four features optimized, hidden channel number (50-400), dropout date (0.05-0.5), batch size (20-180) with indicated ranges of optimization in the colormap bars, the closer to yellow colors the higher the feature value. B) Pareto plot for the BACE dataset showing two colors, the blue dots indicates tested trials, while the the red one corresponds to the trial we selected for our final models. C) Pareto plot for the BBBP dataset, the optimization did not converged into the dual-direction minimization, rather a trade-off between the two directions is observed.
  • ...and 7 more figures