A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction
Rıza Özçelik, Francesca Grisoni
TL;DR
This work systematically evaluates deep chemical language processing (CLP) for bioactivity prediction by comparing CNN, RNN, and Transformer architectures across two molecular string representations (SMILES and SELFIES) and three encoding schemes on ten diverse datasets. It finds that CNNs consistently excel in classification tasks, while Transformers can outperform others on several regression targets, with RNNs rarely leading. The authors provide practical guidelines, recommending SMILES with learnable embeddings, loss re-weighting to handle class imbalance, and broad hyperparameter exploration, while noting opportunities in data augmentation and transfer learning. Overall, the study demonstrates that well-chosen CLP configurations can achieve strong predictive performance and offers actionable, data-driven recommendations to accelerate adoption in drug discovery contexts.
Abstract
Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP learns from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP training, to provide guidelines for newcomers and experts alike. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological choices, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.
