Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking
Thomas Le Menestrel, Manuel Rivas
TL;DR
Smiles2Dock addresses the need for scalable, accessible benchmarks for ML-driven molecular docking by assembling open, large-scale data that pairs AlphaFold protein models with vast ChEMBL ligand libraries. It leverages P2Rank for binding-site prediction and AutoDock Vina to generate over $25$ million docking scores, creating a diverse, SMILES-based dataset suitable for graph, transformer, and CNN-based docking methods. A novel Transformer-based baseline using ESM2 for proteins and MolFormer for ligands demonstrates the feasibility of embedding-based docking predictions, achieving a test $R^2$ up to $0.40$ and $RMSE$ around $2.89$ under favorable configurations. The dataset, packaging, and baseline code are openly available (e.g., HuggingFace), enabling rapid benchmarking and methodological advances in ML-based docking with broad impact on drug discovery research.
Abstract
Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.
