When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

Azmine Toushik Wasi; Šerbetar Karlo; Raima Islam; Taki Hasan Rafi; Dong-Kyu Chae

When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

Azmine Toushik Wasi, Šerbetar Karlo, Raima Islam, Taki Hasan Rafi, Dong-Kyu Chae

TL;DR

By reframing SMILES as text and applying a bag-of-$n$-gram representation to feed an MLP classifier, the study demonstrates that simple NLP approaches can compete with specialized cheminformatics methods for drug-class prediction. The work compares 1- to 5-$gram$ configurations and analyzes TopK token selection, showing a best trade-off around $TopK=1250$. Fingerprint-based models (Morgan, MACCS, AtomPair) provide a strong baseline, with some achieving higher accuracy and ROC-AUC than the NLP baseline. The findings highlight the practicality and accessibility of text-based classification for SMILES, and they offer a lightweight, interpretable alternative to more complex graph- or fingerprint-centric models.

Abstract

Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? Our experiments affirm the possibility with very competitive scores. The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. The data and code are available here: https://github.com/azminewasi/Drug-Classification-NLP.

When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

TL;DR

By reframing SMILES as text and applying a bag-of-

-gram representation to feed an MLP classifier, the study demonstrates that simple NLP approaches can compete with specialized cheminformatics methods for drug-class prediction. The work compares 1- to 5-

configurations and analyzes TopK token selection, showing a best trade-off around

. Fingerprint-based models (Morgan, MACCS, AtomPair) provide a strong baseline, with some achieving higher accuracy and ROC-AUC than the NLP baseline. The findings highlight the practicality and accessibility of text-based classification for SMILES, and they offer a lightweight, interpretable alternative to more complex graph- or fingerprint-centric models.

Abstract

Paper Structure (13 sections, 2 figures, 5 tables)

This paper contains 13 sections, 2 figures, 5 tables.

Introduction
Method
Experiment
Related Works
Experiments
Dataset
Experimental Details
Ablation Study: TopK
Discussion
Discussion on Experimental Findings
Discussion on Practical Impact and Scalability
Discussion on Limitations and Future Works
Conclusion

Figures (2)

Figure 1: Overview of our approach for Drug Classification using Text Classification Methods on Drug SMILES Strings
Figure 2: Molecular structure for a drug named Abemaciclib, with following SMILES string - CCN1CCN(CC2=CC=C(NC3=NC=C(F)C(=N3)C3=CC(F)=C4N=C(C)N(C(C)C)C4=C3)N=C2)CC1drug-bank.

When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

TL;DR

Abstract

When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

Authors

TL;DR

Abstract

Table of Contents

Figures (2)