Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints

Markus Dablander; Thierry Hanser; Renaud Lambiotte; Garrett M. Morris

Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints

Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris

TL;DR

This work describes and recommends Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.

Abstract

Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods; in contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure-selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the $L$ most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, $L$. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated methods across prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints

TL;DR

Abstract

most frequent substructures which are subsequently used to generate a binary fingerprint of desired length,

. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated methods across prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Paper Structure (21 sections, 55 equations, 7 figures, 1 table)

This paper contains 21 sections, 55 equations, 7 figures, 1 table.

Introduction
Methodology
Results and Discussion
Conclusions
Appendix: Further Computational Results

Figures (7)

Figure 1: Schematic overview of the vectorisation of ECFPs via the four investigated substructure-pooling techniques which in general lead to four different final representations. As before, $\mathfrak{T} = \{\mathcal{M}_1, ..., \mathcal{M}_n\}$ represents a given set of $n$ training compounds and $f_{\text{labels}} : \mathfrak{T} \to \mathbb{R}$ an associated labelling function that assigns regression or classification labels to the training set.
Figure 2: Predictive performance of the four investigated substructure-pooling methods (indicated by colours) for the lipophilicity regression data set using varying data splitting techniques, machine-learning models and ECFP hyperparameters. Each bar shows the average mean absolute error (MAE) of the associated model across $2$-fold cross validation repeated with $3$ random seeds. The error bar length equals two standard deviations of the performance measured over the $2 * 3 = 6$ trained models.
Figure 3: Boxplots visualising the predictive performance of the four investigated substructure-pooling methods (indicated by colours) for $20$ distinct modelling scenarios differing by data sets, data splitting techniques, and machine-learning models. Each boxplot summarises $24$ distinct performance results (each computed as an average over $2$-fold cross validation repeated with $3$ random seeds) generated by combining a respective substructure-pooling method with $24$ distinct ECFP-hyperparameter settings. The exhaustively explored ECFP-hyperparameter grid is given by three maximal substructure diameters $D \in \{2, 4, 6\}$, two lists of atomic invariants $A \in \{\text{standard ECFP, pharmacophoric FCFP}\}$, and four fingerprint dimensions $L \in \{512, 1024, 2048, 4096\}$.
Figure 4: Predictive performance of the four investigated substructure-pooling methods (indicated by colours) for the aqueous solubility regression data set using varying data splitting techniques, machine-learning models and ECFP hyperparameters. Each bar shows the average mean absolute error (MAE) of the associated model across $2$-fold cross validation repeated with $3$ random seeds. The error bar length equals two standard deviations of the performance measured over the $2 * 3 = 6$ trained models.
Figure 5: Predictive performance of the four investigated substructure-pooling methods (indicated by colours) for the SARS-CoV-2 main protease binding affinity regression data set using varying data splitting techniques, machine-learning models and ECFP hyperparameters. Each bar shows the average mean absolute error (MAE) of the associated model across $2$-fold cross validation repeated with $3$ random seeds. The error bar length equals two standard deviations of the performance measured over the $2 * 3 = 6$ trained models.
...and 2 more figures

Theorems & Definitions (2)

Definition 1: Substructure Pooling
Definition 2: One-Hot Encoding

Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints

TL;DR

Abstract

Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)