Table of Contents
Fetching ...

Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT

Andrew T. McNutt, Abhinav K. Adduri, Caleb N. Ellington, Monica T. Dayao, Eric P. Xing, Hosein Mohimani, David R. Koes

TL;DR

The paper addresses the need for scalable, accurate, and interpretable virtual screening by moving beyond structure-based docking to a vector-based DTI framework. It introduces SPRINT, which learns drug-target co-embeddings using structure-aware protein representations and multi-head attention pooling to enable rapid, proteome-scale searches and interpretable residue-level attention. SPRINT achieves state-of-the-art results on DTI classification, virtual screening benchmarks, and competitive binding affinity predictions, while enabling pan-species querying across billions of molecules with a vector store. This work promises to accelerate in silico drug discovery and repurposing by democratizing large-scale virtual screening and providing mechanistic insights through attention analyses.

Abstract

Virtual screening of small molecules against protein targets can accelerate drug discovery and development by predicting drug-target interactions (DTIs). However, structure-based methods like molecular docking are too slow to allow for broad proteome-scale screens, limiting their application in screening for off-target effects or new molecular mechanisms. Recently, vector-based methods using protein language models (PLMs) have emerged as a complementary approach that bypasses explicit 3D structure modeling. Here, we develop SPRINT, a vector-based approach for screening entire chemical libraries against whole proteomes for DTIs and novel mechanisms of action. SPRINT improves on prior work by using a self-attention based architecture and structure-aware PLMs to learn drug-target co-embeddings for binder prediction, search, and retrieval. SPRINT achieves SOTA enrichment factors in virtual screening on LIT-PCBA, DTI classification benchmarks, and binding affinity prediction benchmarks, while providing interpretability in the form of residue-level attention maps. In addition to being both accurate and interpretable, SPRINT is ultra-fast: querying the whole human proteome against the ENAMINE Real Database (6.7B drugs) for the 100 most likely binders per protein takes 16 minutes. SPRINT promises to enable virtual screening at an unprecedented scale, opening up new opportunities for in silico drug repurposing and development. SPRINT is available on the web as ColabScreen: https://bit.ly/colab-screen

Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT

TL;DR

The paper addresses the need for scalable, accurate, and interpretable virtual screening by moving beyond structure-based docking to a vector-based DTI framework. It introduces SPRINT, which learns drug-target co-embeddings using structure-aware protein representations and multi-head attention pooling to enable rapid, proteome-scale searches and interpretable residue-level attention. SPRINT achieves state-of-the-art results on DTI classification, virtual screening benchmarks, and competitive binding affinity predictions, while enabling pan-species querying across billions of molecules with a vector store. This work promises to accelerate in silico drug discovery and repurposing by democratizing large-scale virtual screening and providing mechanistic insights through attention analyses.

Abstract

Virtual screening of small molecules against protein targets can accelerate drug discovery and development by predicting drug-target interactions (DTIs). However, structure-based methods like molecular docking are too slow to allow for broad proteome-scale screens, limiting their application in screening for off-target effects or new molecular mechanisms. Recently, vector-based methods using protein language models (PLMs) have emerged as a complementary approach that bypasses explicit 3D structure modeling. Here, we develop SPRINT, a vector-based approach for screening entire chemical libraries against whole proteomes for DTIs and novel mechanisms of action. SPRINT improves on prior work by using a self-attention based architecture and structure-aware PLMs to learn drug-target co-embeddings for binder prediction, search, and retrieval. SPRINT achieves SOTA enrichment factors in virtual screening on LIT-PCBA, DTI classification benchmarks, and binding affinity prediction benchmarks, while providing interpretability in the form of residue-level attention maps. In addition to being both accurate and interpretable, SPRINT is ultra-fast: querying the whole human proteome against the ENAMINE Real Database (6.7B drugs) for the 100 most likely binders per protein takes 16 minutes. SPRINT promises to enable virtual screening at an unprecedented scale, opening up new opportunities for in silico drug repurposing and development. SPRINT is available on the web as ColabScreen: https://bit.ly/colab-screen

Paper Structure

This paper contains 16 sections, 4 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Gnina CNN VS scores of the molecules found during DeepDocking and the molecules picked with SPRINT.
  • Figure 2: Comparing the average attention weight of binding and non-binding residues on our set of 109 single-chain protein-ligand binding structures after training on the MERGED Dataset (Methodology detailed in Appendix \ref{['sec:Attn_interrogation']}). We visualize the Protbert and SaProt models trained with equal positive and negative sampling. The horizontal line indicates the average across the proteins. Visualizations of the ProtBert and SaProt models trained with increased negative sampling are in Figure \ref{['fig:bs_attn_fullfig']}).
  • Figure 3: Analyzing the attention on PDB ID 2X4Z using ProtBert and SaProt models trained with equal ratio of positive and negative examples (identical models trained with different initial random seeds visualized in Figures \ref{['fig:struct_attn_protbert_one_2X4Z']} and \ref{['fig:struct_attn_saprot_one_2X4Z']}; models trained with increased negative sampling visualized in Figures \ref{['fig:struct_attn_protbert_2X4Z']} and \ref{['fig:struct_attn_saprot_2X4Z']}). Each column is a different attention head. Gradient from white to red indicates the attention weight, where white is no attention and red is max attention for that head. The ligand is shown in blue.
  • Figure 4: Times for predicting the top DTIs for a ligand using vector search.
  • Figure 5: SPRINT learns protein representations via a multi-head attention pooling scheme. Then, SPRINT learns a shared co-embedding space between molecules and protein targets via modality-specific neural networks $C_d$ and $C_t$. The model is trained end-to-end via a binary cross entropy loss on binding and non-binding drug-target pairs, where the probability of interaction is computed as a sigmoid function of the cosine distance between the drug and target embeddings. The learnable parameters of the network are depicted with dashed borders.
  • ...and 7 more figures