Predicting Side Effect of Drug Molecules using Recurrent Neural Networks
Collin Beaudoin, Koustubh Phalak, Swaroop Ghosh
TL;DR
The study tackles the challenge of predicting drug side effects from molecular structures by using a GRU-based recurrent neural network that leverages SELFIES representations to learn context from SMILES-derived sequences. It achieves a dramatic parameter-efficiency, reporting a $98\%$–$99\%$ reduction in parameters compared with large graph-based or language models while delivering near-state-of-the-art accuracy on MoleculeNet benchmarks (e.g., SIDER, BBBP, ClinTox). The authors demonstrate strong ROC-AUC performance with a lightweight model, and provide detailed cross-dataset comparisons against GROVER, ChemRL-GEM, and Galactica, highlighting practical advantages in compute and data requirements. The work suggests that accessible, smaller sequence models can substantially democratize molecular property prediction, enabling chemists to perform rapid, pre-experimental screening with meaningful impact on drug development timelines and costs.
Abstract
Identification and verification of molecular properties such as side effects is one of the most important and time-consuming steps in the process of molecule synthesis. For example, failure to identify side effects before submission to regulatory groups can cost millions of dollars and months of additional research to the companies. Failure to identify side effects during the regulatory review can also cost lives. The complexity and expense of this task have made it a candidate for a machine learning-based solution. Prior approaches rely on complex model designs and excessive parameter counts for side effect predictions. We believe reliance on complex models only shifts the difficulty away from chemists rather than alleviating the issue. Implementing large models is also expensive without prior access to high-performance computers. We propose a heuristic approach that allows for the utilization of simple neural networks, specifically the recurrent neural network, with a 98+% reduction in the number of required parameters compared to available large language models while still obtaining near identical results as top-performing models.
