Table of Contents
Fetching ...

Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search

Kathryn E. Kirchoff, James Wellnitz, Joshua E. Hochuli, Travis Maxfield, Konstantin I. Popov, Shawn Gomez, Alexander Tropsha

TL;DR

This paper tackles the bottleneck of scaling exact chemical similarity search to billion-scale databases by marrying low-dimensional, structure-aware embeddings with a memory-efficient k-d tree. The authors introduce SmallSA, a learned embedding trained via the SALSA framework to dimensions of $8$ and $16$, and combine it with a custom kd-tree to perform sublinear, exact nearest-neighbor queries on a dataset of $1.3$ billion compounds. Across GED-based quality, RDKit virtual screening benchmarks, and speed tests, SmallSA-based methods demonstrate competitive or superior performance to traditional high-dimensional fingerprints while delivering speedups of up to $10^5$-fold on modest hardware. The work underscores the practicality of fast, exact chemical similarity searching at billion-scale sizes and highlights exciting avenues for applying low-dimensional embeddings to other cheminformatics tasks and range queries.

Abstract

Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding -- SmallSA -- for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.

Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search

TL;DR

This paper tackles the bottleneck of scaling exact chemical similarity search to billion-scale databases by marrying low-dimensional, structure-aware embeddings with a memory-efficient k-d tree. The authors introduce SmallSA, a learned embedding trained via the SALSA framework to dimensions of and , and combine it with a custom kd-tree to perform sublinear, exact nearest-neighbor queries on a dataset of billion compounds. Across GED-based quality, RDKit virtual screening benchmarks, and speed tests, SmallSA-based methods demonstrate competitive or superior performance to traditional high-dimensional fingerprints while delivering speedups of up to -fold on modest hardware. The work underscores the practicality of fast, exact chemical similarity searching at billion-scale sizes and highlights exciting avenues for applying low-dimensional embeddings to other cheminformatics tasks and range queries.

Abstract

Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding -- SmallSA -- for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.
Paper Structure (24 sections, 4 figures, 1 table)

This paper contains 24 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the similarity search framework. k-d trees are combined with with low-dimensional chemical embeddings to produce a partitioned chemical space, which can be quickly queried for nearest neighbors.
  • Figure 2: The average approximate graph edit distance (GED) between query molecule and nearest neighbors, per method, shown as a function of the number of neighbors considered. Lower distances are better. Lines are smoothed using a running average approach for simplicity of analysis.
  • Figure 3: AUROC achieved by each embedding on the RDKit virtual screening benchmark of 69 query targets, grouped by target database. Each database is indicated on the x-axis. Note that out of the 69 targets, most targets (50) belong to the ChEMBL database.
  • Figure 4: Example of a query molecule and the hits obtained by select high-performing embeddings.