YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

Chenxu Wang; Haowei Ming; Jian He; Yao Lu; Junhong Chen

YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

Chenxu Wang, Haowei Ming, Jian He, Yao Lu, Junhong Chen

TL;DR

This work tackles the problem of predicting organic drug solubility with high accuracy by introducing the YZS-Model, a multi-model framework that combines Graph Convolutional Networks (GCN), Self-Attention Transformer, and Long Short-Term Memory (LSTM) networks to capture both spatial molecular topology and sequential information. Trained on a large Cui2020-derived dataset and evaluated on anticancer and Llinas test sets, the model achieves $R^2$ up to $0.59$ and RMSE as low as $0.57$ on challenging benchmarks, outperforming established baselines like AttentiveFP and SolTransNet. The paper provides extensive interpretability analyses using random feature zeroing and LIME to identify key atomic features (e.g., Aromaticity, Hybridization, Degree, Hydrogen) and demonstrates the critical role of the Transformer and LSTM modules through ablation studies. These results underscore the potential of integrating graph-based and sequence-aware deep learning components for accurate solubility prediction, with meaningful implications for accelerating drug design and reducing development costs. Future work includes semi-supervised learning, multi-scale attention, and ensemble strategies to further improve generalization to rare molecular structures.

Abstract

Accurate prediction of drug molecule solubility is crucial for therapeutic effectiveness and safety. Traditional methods often miss complex molecular structures, leading to inaccuracies. We introduce the YZS-Model, a deep learning framework integrating Graph Convolutional Networks (GCN), Transformer architectures, and Long Short-Term Memory (LSTM) networks to enhance prediction precision. GCNs excel at capturing intricate molecular topologies by modeling the relationships between atoms and bonds. Transformers, with their self-attention mechanisms, effectively identify long-range dependencies within molecules, capturing global interactions. LSTMs process sequential data, preserving long-term dependencies and integrating temporal information within molecular sequences. This multifaceted approach leverages the strengths of each component, resulting in a model that comprehensively understands and predicts molecular properties. Trained on 9,943 compounds and tested on an anticancer dataset, the YZS-Model achieved an $R^2$ of 0.59 and an RMSE of 0.57, outperforming benchmark models ($R^2$ of 0.52 and RMSE of 0.61). In an independent test, it demonstrated an RMSE of 1.05, improving accuracy by 45.9%. The integration of these deep learning techniques allows the YZS-Model to learn valuable features from complex data without predefined parameters, handle large datasets efficiently, and adapt to various molecular types. This comprehensive capability significantly improves predictive accuracy and model generalizability. Its precision in solubility predictions can expedite drug development by optimizing candidate selection, reducing costs, and enhancing efficiency. Our research underscores deep learning's transformative potential in pharmaceutical science, particularly for solubility prediction and drug design.

YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

TL;DR

up to

and RMSE as low as

on challenging benchmarks, outperforming established baselines like AttentiveFP and SolTransNet. The paper provides extensive interpretability analyses using random feature zeroing and LIME to identify key atomic features (e.g., Aromaticity, Hybridization, Degree, Hydrogen) and demonstrates the critical role of the Transformer and LSTM modules through ablation studies. These results underscore the potential of integrating graph-based and sequence-aware deep learning components for accurate solubility prediction, with meaningful implications for accelerating drug design and reducing development costs. Future work includes semi-supervised learning, multi-scale attention, and ensemble strategies to further improve generalization to rare molecular structures.

Abstract

of 0.59 and an RMSE of 0.57, outperforming benchmark models (

of 0.52 and RMSE of 0.61). In an independent test, it demonstrated an RMSE of 1.05, improving accuracy by 45.9%. The integration of these deep learning techniques allows the YZS-Model to learn valuable features from complex data without predefined parameters, handle large datasets efficiently, and adapt to various molecular types. This comprehensive capability significantly improves predictive accuracy and model generalizability. Its precision in solubility predictions can expedite drug development by optimizing candidate selection, reducing costs, and enhancing efficiency. Our research underscores deep learning's transformative potential in pharmaceutical science, particularly for solubility prediction and drug design.

Paper Structure (22 sections, 5 equations, 16 figures, 6 tables)

This paper contains 22 sections, 5 equations, 16 figures, 6 tables.

Introduction
Materials and Methods
Training and Testing
Data Preprocessing
10-Fold Data Split
Molecular Feature Extraction
Graph Neural Network
Graph Convolution Network
Self-Attention Transformer
Long Short-term Memory
YZS-Model
Implementation Details
Evaluation Metrics
Results and Discussion
Performance of YZS-Model
...and 7 more sections

Figures (16)

Figure 1: Solubility (log S) Distribution Across Datasets.
Figure 2: Distribution of Molecular Features in the Training Dataset.
Figure 3: An overview of the Transformer model structure, LN, MSA, and FF, demonstrates critical steps in sequence data processing.
Figure 4: Architecture of the YZS-Model. Initially, drug molecules are converted from SMILES notation into a graph representation, with each atom encoded as a node and bonds as edges. Features are aggregated through a Graph Convolutional Network (GCN) to capture structural information. Subsequently, these features are processed by a self-attention Transformer layer to capture global dependencies. The sequence dynamics are then analyzed via an LSTM layer. Finally, graph pooling and Linear Layer are used to output solubility predictions.
Figure 5: Error probability distribution of the YZS-Model on the test dataset.
...and 11 more figures

YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

TL;DR

Abstract

YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (16)