Table of Contents
Fetching ...

Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network

Tanya Liyaqat, Tanvir Ahmad, Mohammad Kashif, Chandni Saxena

TL;DR

This work presents STEM, a stacked ensemble for mutagenicity prediction that fuses substructural fingerprints, physicochemical descriptors, and graph-based spatial embeddings via a Graph Attention Network. The two-phase pipeline first encodes molecules into diverse representations and then combines the base-model predictions through a deep neural network, with SHAP providing explanations of feature importance. On Hansen and Xu Ames mutagenicity datasets, STEM achieves state-of-the-art AUC values (notably 0.9521 on Hansen) and demonstrates strong robustness across multiple random seeds and cross-validation splits. The approach yields both high predictive performance and actionable interpretability, including alignment with known structural alerts and insights into which modalities most influence decisions. This multimodal, explainable framework advances early mutagenicity screening in drug discovery and offers a scalable blueprint for integrating heterogeneous molecular information.

Abstract

Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences, including the development of cancer. Earlier identification of mutagenic compounds in the drug development process is therefore crucial for preventing the progression of unsafe candidates and reducing development costs. While computational techniques, especially machine learning models have become increasingly prevalent for this endpoint, they rely on a single modality. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model which incorporate multiple modalities such as simplified molecular input line entry system (SMILES) and molecular graph. These modalities capture diverse information about molecules such as substructural, physicochemical, geometrical and topological. To derive substructural, geometrical and physicochemical information, we use SMILES, while topological information is extracted through a graph attention network (GAT) via molecular graph. Our model uses a stacked ensemble of machine learning classifiers to make predictions using these multiple features. We employ the explainable artificial intelligence (XAI) technique SHAP (Shapley Additive Explanations) to determine the significance of each classifier and the most relevant features in the prediction. We demonstrate that our method surpasses SOTA methods on two standard datasets across various metrics. Notably, we achieve an area under the curve of 95.21\% on the Hansen benchmark dataset, affirming the efficacy of our method in predicting mutagenicity. We believe that this research will captivate the interest of both clinicians and computational biologists engaged in translational research.

Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network

TL;DR

This work presents STEM, a stacked ensemble for mutagenicity prediction that fuses substructural fingerprints, physicochemical descriptors, and graph-based spatial embeddings via a Graph Attention Network. The two-phase pipeline first encodes molecules into diverse representations and then combines the base-model predictions through a deep neural network, with SHAP providing explanations of feature importance. On Hansen and Xu Ames mutagenicity datasets, STEM achieves state-of-the-art AUC values (notably 0.9521 on Hansen) and demonstrates strong robustness across multiple random seeds and cross-validation splits. The approach yields both high predictive performance and actionable interpretability, including alignment with known structural alerts and insights into which modalities most influence decisions. This multimodal, explainable framework advances early mutagenicity screening in drug discovery and offers a scalable blueprint for integrating heterogeneous molecular information.

Abstract

Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences, including the development of cancer. Earlier identification of mutagenic compounds in the drug development process is therefore crucial for preventing the progression of unsafe candidates and reducing development costs. While computational techniques, especially machine learning models have become increasingly prevalent for this endpoint, they rely on a single modality. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model which incorporate multiple modalities such as simplified molecular input line entry system (SMILES) and molecular graph. These modalities capture diverse information about molecules such as substructural, physicochemical, geometrical and topological. To derive substructural, geometrical and physicochemical information, we use SMILES, while topological information is extracted through a graph attention network (GAT) via molecular graph. Our model uses a stacked ensemble of machine learning classifiers to make predictions using these multiple features. We employ the explainable artificial intelligence (XAI) technique SHAP (Shapley Additive Explanations) to determine the significance of each classifier and the most relevant features in the prediction. We demonstrate that our method surpasses SOTA methods on two standard datasets across various metrics. Notably, we achieve an area under the curve of 95.21\% on the Hansen benchmark dataset, affirming the efficacy of our method in predicting mutagenicity. We believe that this research will captivate the interest of both clinicians and computational biologists engaged in translational research.
Paper Structure (24 sections, 11 equations, 9 figures, 8 tables)

This paper contains 24 sections, 11 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The architecture of the proposed method STEM
  • Figure 3: The process of stacking with 5-fold CV
  • Figure 4: The 2D t-SNE representation of the gathered compounds illustrates: (A) the dataset encompassing features (B) associated labels. The dimensions t-SNE1 and t-SNE2 result from the reduction of the original feature space
  • Figure 5: The flow diagram of Graph Attention Network
  • Figure 6: The change in metric results across 10 random seeds during blind test on both (a) Hansen and (b) Xu dataset.
  • ...and 4 more figures