How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval
Philip Fradkin, Puria Azadi, Karush Suri, Frederik Wenkel, Ali Bashashati, Maciej Sypetkowski, Dominique Beaini
TL;DR
This work tackles predicting molecular perturbations' effects on cellular morphology by learning a joint latent space between phenomic experiments and molecular structures via contrastive learning. It introduces MolPhenix, a framework that leverages a pretrained phenomic encoder, a pretrained molecular encoder, and a novel inter-sample, concentration-aware S2L loss to mitigate batch effects and inactive perturbations while capturing dose-dependent effects. The approach achieves large gains in zero-shot active molecule retrieval (up to 8.1x over prior SOTA, reaching 77.33% top-1% for active molecules) and demonstrates robust performance across cumulative and held-out concentrations, aided by explicit and implicit concentration encoding and embedding averaging. These results suggest significant potential for virtual phenomics screening to accelerate drug discovery, with broad implications for multi-modal medical ML and biologically grounded representation learning.
Abstract
Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellular morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem ofContrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1x improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy. These results open the door for machine learning to be applied in virtual phenomics screening, which can significantly benefit drug discovery applications.
