YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention
Chenxu Wang, Haowei Ming, Jian He, Yao Lu, Junhong Chen
TL;DR
This work tackles the problem of predicting organic drug solubility with high accuracy by introducing the YZS-Model, a multi-model framework that combines Graph Convolutional Networks (GCN), Self-Attention Transformer, and Long Short-Term Memory (LSTM) networks to capture both spatial molecular topology and sequential information. Trained on a large Cui2020-derived dataset and evaluated on anticancer and Llinas test sets, the model achieves $R^2$ up to $0.59$ and RMSE as low as $0.57$ on challenging benchmarks, outperforming established baselines like AttentiveFP and SolTransNet. The paper provides extensive interpretability analyses using random feature zeroing and LIME to identify key atomic features (e.g., Aromaticity, Hybridization, Degree, Hydrogen) and demonstrates the critical role of the Transformer and LSTM modules through ablation studies. These results underscore the potential of integrating graph-based and sequence-aware deep learning components for accurate solubility prediction, with meaningful implications for accelerating drug design and reducing development costs. Future work includes semi-supervised learning, multi-scale attention, and ensemble strategies to further improve generalization to rare molecular structures.
Abstract
Accurate prediction of drug molecule solubility is crucial for therapeutic effectiveness and safety. Traditional methods often miss complex molecular structures, leading to inaccuracies. We introduce the YZS-Model, a deep learning framework integrating Graph Convolutional Networks (GCN), Transformer architectures, and Long Short-Term Memory (LSTM) networks to enhance prediction precision. GCNs excel at capturing intricate molecular topologies by modeling the relationships between atoms and bonds. Transformers, with their self-attention mechanisms, effectively identify long-range dependencies within molecules, capturing global interactions. LSTMs process sequential data, preserving long-term dependencies and integrating temporal information within molecular sequences. This multifaceted approach leverages the strengths of each component, resulting in a model that comprehensively understands and predicts molecular properties. Trained on 9,943 compounds and tested on an anticancer dataset, the YZS-Model achieved an $R^2$ of 0.59 and an RMSE of 0.57, outperforming benchmark models ($R^2$ of 0.52 and RMSE of 0.61). In an independent test, it demonstrated an RMSE of 1.05, improving accuracy by 45.9%. The integration of these deep learning techniques allows the YZS-Model to learn valuable features from complex data without predefined parameters, handle large datasets efficiently, and adapt to various molecular types. This comprehensive capability significantly improves predictive accuracy and model generalizability. Its precision in solubility predictions can expedite drug development by optimizing candidate selection, reducing costs, and enhancing efficiency. Our research underscores deep learning's transformative potential in pharmaceutical science, particularly for solubility prediction and drug design.
