A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

Rıza Özçelik; Francesca Grisoni

A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

Rıza Özçelik, Francesca Grisoni

TL;DR

This work systematically evaluates deep chemical language processing (CLP) for bioactivity prediction by comparing CNN, RNN, and Transformer architectures across two molecular string representations (SMILES and SELFIES) and three encoding schemes on ten diverse datasets. It finds that CNNs consistently excel in classification tasks, while Transformers can outperform others on several regression targets, with RNNs rarely leading. The authors provide practical guidelines, recommending SMILES with learnable embeddings, loss re-weighting to handle class imbalance, and broad hyperparameter exploration, while noting opportunities in data augmentation and transfer learning. Overall, the study demonstrates that well-chosen CLP configurations can achieve strong predictive performance and offers actionable, data-driven recommendations to accelerate adoption in drug discovery contexts.

Abstract

Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP learns from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP training, to provide guidelines for newcomers and experts alike. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological choices, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.

A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 5 figures, 2 tables)

This paper contains 15 sections, 1 equation, 5 figures, 2 tables.

Introduction
Methods
Molecular String Representations
Token Encoding
Deep Learning Architectures
Bioactivity Datasets
Performance Evaluation
Experimental Setup
Data Preparation
Model Training and Optimization
Results
Choosing a Neural Network Architecture
Representing and Encoding Molecular Structures
Other Tricks of the Trade
So Long, and Thanks for All the Data

Figures (5)

Figure 1: Deep Chemical Language Processing for Bioactivity Prediction.(a) String notations such as SMILES and SELFIES represent a molecular graph as a sequence of characters ('tokens'). The atoms are represented with periodic table symbols, while branches, rings, and bonds are assigned special characters. (b) Token encoding, where the chosen molecular string is converted into a matrix to train deep learning models. One-hot encoding represents each token with a unique binary vector. Random encoding maps tokens to fixed, unique, and continuous vectors. Learnable encoding starts with a random vector per token and updates the vectors during training to improve the model performance. (c) Architectures used in this study. Convolutional neural networks slide windows over the input sequences, and learn to weight and aggregate the input elements. Recurrent neural networks iterate over the input tokens in a step-wise manner, and update the 'hidden' information learned from the sequence ($h_i$). Transformers learn all-pair relationships between the input tokens and learn to weight each input representation to create the representations in the next layers ($a_i$).
Figure 2: Overview of dataset similarity and of model performance.(a,b) Distribution of test set similarities in comparison with training set molecules. The similarity was quantified as the Tanimoto coefficient on extended connectivity fingerprintsrogers2010extended, and the maximum similarity was reported. Different distributions can be observed in the classification (a) and regression (b) datasets, with the former containing more dissimilar molecules on average. (c,d) Performance of neural network architectures across datasets. Bar plots indicate the mean test set performance (with error bars denoting the standard deviation), in comparison with the XGBoost baseline (dashed line: average performance, shaded area: standard deviation). Performance was quantified as balanced accuracy in classification (c), and as concordance index in regression (d).
Figure 3: Effect of input molecular strings and of token encoding strategies. (a,b) Performance of SMILES and SELFIES representations on the model performance. Classification (a) and regression dataset (b) are analyzed separately. (c,d) Performance of token encoding strategies on classification (c) and regression (d). For all plots, bars indicate the mean performance on the test set of each notation, and error bars indicate the standard deviation. The performance of the XGBoost baseline is also indicated (dashed line: average; shaded area: standard deviation).
Figure 4: Effect of loss re-weighting. Comparison of the classification performance obtained with and without loss re-weighting (i.e., assigning different weights to the molecules, as the inverse of their class frequency).
Figure 5: Hyperparameter tuning. (a-d) Most frequently occurring hyperparameter values among the top-ten models per dataset (CNN architecture, with SMILES strings and learnable embeddings). The following parameters were investigated: number of convolution layers (a), kernel length (b), number of filters (c), and token embedding dimension (d). (e,f) Model performance vs. explored hyperparameter space size. Performance of progressively subsampled models from 1 to 432 hyperparameter configurations (total) for both classification (e) and regression (f). The dashed line indicates 50% of models being explored.

A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

TL;DR

Abstract

A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)