Hermes: Large DEL Datasets Train Generalizable Protein-Ligand Binding Prediction Models
Maxwell Kleinsasser, Brayden J. Halverson, Edward Kraft, Sean Francis-Lyon, Sarah E. Hugo, Mackenzie R. Roman, Ben Miller, Andrew D. Blevins, Ian K. Quigley
TL;DR
Hermes introduces a lightweight transformer trained exclusively on large-scale DEL screening data to learn transferable protein–ligand interaction representations, enabling generalization to held-out targets and unseen chemistries without traditional affinity labels. By using pre-trained embeddings (ESM-2 for proteins and ChemBERTa for ligands) and a joint cross-attention mechanism, Hermes efficiently fuses protein and ligand information and supports fast inference suitable for virtual screening. Across diverse benchmarks, Hermes generalizes to external datasets and different assay systems, though performance varies with target space and data quality, with an ensemble of checkpoints improving stability. The results highlight the value of DEL data for learning transferable PLI representations and demonstrate substantial speed advantages over structure-based models, suggesting DEL-trained models can drive scalable, early-stage drug discovery while acknowledging limitations from label noise and memorization tendencies.
Abstract
The quality and consistency of training data remain critical bottlenecks for protein-ligand binding prediction. Public affinity datasets, aggregated from thousands of labs and assay formats, introduce biases that limit model generalization and complicate evaluation. DNA-encoded chemical libraries (DELs) offer a potential solution: unified experimental protocols generating massive binding datasets across diverse chemical and protein target space. We present Hermes, a lightweight transformer trained exclusively on DEL data from screens against hundreds of protein targets, representing one of the largest and most protein-diverse DEL training sets applied to protein-ligand interaction (PLI) modeling to date. Despite never seeing traditional affinity measurements during training, Hermes generalizes to held-out targets, novel chemical scaffolds, and external benchmarks derived from public binding data and high-throughput screens. Our results demonstrate that DEL data alone captures transferable protein-ligand interaction representations, while Hermes' minimal architecture enables inference speeds suitable for large-scale virtual screening.
