Table of Contents
Fetching ...

PharmacoMatch: Efficient 3D Pharmacophore Screening via Neural Subgraph Matching

Daniel Rose, Oliver Wieder, Thomas Seidel, Thierry Langer

TL;DR

PharmacoMatch tackles the scalability bottleneck of 3D pharmacophore screening in huge chemical spaces by reframing pharmacophore matching as approximate neural subgraph matching learned through self-supervised contrastive learning. A graph neural network encoder maps pharmacophore graphs to an order-embedding space, trained with on-the-fly augmented positive/negative pairs via a max-margin loss to capture query–target relationships. The approach yields substantial runtime speedups (embedding once, fast vector comparisons) while maintaining competitive screening performance against traditional alignment methods on benchmark datasets, enabling practical pre-screening for billion-scale libraries. This work demonstrates the feasibility of vector-database–backed virtual screening and highlights avenues for further improvements in geometry precision and stereochemical discrimination.

Abstract

The increasing size of screening libraries poses a significant challenge for the development of virtual screening methods for drug discovery, necessitating a re-evaluation of traditional approaches in the era of big data. Although 3D pharmacophore screening remains a prevalent technique, its application to very large datasets is limited by the computational cost associated with matching query pharmacophores to database molecules. In this study, we introduce PharmacoMatch, a novel contrastive learning approach based on neural subgraph matching. Our method reinterprets pharmacophore screening as an approximate subgraph matching problem and enables efficient querying of conformational databases by encoding query-target relationships in the embedding space. We conduct comprehensive investigations of the learned representations and evaluate PharmacoMatch as pre-screening tool in a zero-shot setting. We demonstrate significantly shorter runtimes and comparable performance metrics to existing solutions, providing a promising speed-up for screening very large datasets.

PharmacoMatch: Efficient 3D Pharmacophore Screening via Neural Subgraph Matching

TL;DR

PharmacoMatch tackles the scalability bottleneck of 3D pharmacophore screening in huge chemical spaces by reframing pharmacophore matching as approximate neural subgraph matching learned through self-supervised contrastive learning. A graph neural network encoder maps pharmacophore graphs to an order-embedding space, trained with on-the-fly augmented positive/negative pairs via a max-margin loss to capture query–target relationships. The approach yields substantial runtime speedups (embedding once, fast vector comparisons) while maintaining competitive screening performance against traditional alignment methods on benchmark datasets, enabling practical pre-screening for billion-scale libraries. This work demonstrates the feasibility of vector-database–backed virtual screening and highlights avenues for further improvements in geometry precision and stereochemical discrimination.

Abstract

The increasing size of screening libraries poses a significant challenge for the development of virtual screening methods for drug discovery, necessitating a re-evaluation of traditional approaches in the era of big data. Although 3D pharmacophore screening remains a prevalent technique, its application to very large datasets is limited by the computational cost associated with matching query pharmacophores to database molecules. In this study, we introduce PharmacoMatch, a novel contrastive learning approach based on neural subgraph matching. Our method reinterprets pharmacophore screening as an approximate subgraph matching problem and enables efficient querying of conformational databases by encoding query-target relationships in the embedding space. We conduct comprehensive investigations of the learned representations and evaluate PharmacoMatch as pre-screening tool in a zero-shot setting. We demonstrate significantly shorter runtimes and comparable performance metrics to existing solutions, providing a promising speed-up for screening very large datasets.
Paper Structure (51 sections, 14 equations, 18 figures, 6 tables)

This paper contains 51 sections, 14 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Overview of the PharmacoMatch workflow: Conformer and pharmacophore generation from ligands and query creation, for example from a ligand-protein complex, precede pharmacophore screening. The encoder model converts the screening database into embedding vectors, stored for later use. A hitlist is generated by comparing the query embedding with the database embeddings.
  • Figure 2: Illustration of the pharmacophore matching objective: The aim is to match the pharmacophoric points of a query with the corresponding points of a target pharmacophore such that the query points fall within the tolerance sphere of the target points, with a tolerance radius $r_T$.
  • Figure 3: (a) The encoder model learns an order embedding space by comparing augmented pharmacophores. (b) Illustration of the embedding space, where pharmacophores matching a query are positioned to the upper right. (c) Augmentation strategies for model training involve generating positive and negative query-target pairs on-the-fly by combining node deletion with varying degrees of node displacement. Negative pairs are also created by shuffling the batch, mapping query pharmacophores to random target pharmacophores.
  • Figure 4: (a.) Dimensionality reduction of the ADA target's embedding space via PCA, with embeddings labeled by pharmacophoric feature point count. (b.) Dimensionality reduction via UMAP, with embeddings labeled by pharmacophoric feature point type. (c.) Experimental validation of the model's perception of 3D point positions, showing the mean matching decision function versus the displacement radius $r_D$ of the augmentation, with a decision threshold set to $t = 6500$.
  • Figure 5: Pharmacophoric feature point statistics of the training data. The respective histograms display the total number of pharmacophoric feature points and the number of points of specific types per pharmacophore in the training data. The complete training dataset contains 1,217,361 distinct pharmacophores.
  • ...and 13 more figures