Table of Contents
Fetching ...

How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval

Philip Fradkin, Puria Azadi, Karush Suri, Frederik Wenkel, Ali Bashashati, Maciej Sypetkowski, Dominique Beaini

TL;DR

This work tackles predicting molecular perturbations' effects on cellular morphology by learning a joint latent space between phenomic experiments and molecular structures via contrastive learning. It introduces MolPhenix, a framework that leverages a pretrained phenomic encoder, a pretrained molecular encoder, and a novel inter-sample, concentration-aware S2L loss to mitigate batch effects and inactive perturbations while capturing dose-dependent effects. The approach achieves large gains in zero-shot active molecule retrieval (up to 8.1x over prior SOTA, reaching 77.33% top-1% for active molecules) and demonstrates robust performance across cumulative and held-out concentrations, aided by explicit and implicit concentration encoding and embedding averaging. These results suggest significant potential for virtual phenomics screening to accelerate drug discovery, with broad implications for multi-modal medical ML and biologically grounded representation learning.

Abstract

Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellular morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem ofContrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1x improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy. These results open the door for machine learning to be applied in virtual phenomics screening, which can significantly benefit drug discovery applications.

How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval

TL;DR

This work tackles predicting molecular perturbations' effects on cellular morphology by learning a joint latent space between phenomic experiments and molecular structures via contrastive learning. It introduces MolPhenix, a framework that leverages a pretrained phenomic encoder, a pretrained molecular encoder, and a novel inter-sample, concentration-aware S2L loss to mitigate batch effects and inactive perturbations while capturing dose-dependent effects. The approach achieves large gains in zero-shot active molecule retrieval (up to 8.1x over prior SOTA, reaching 77.33% top-1% for active molecules) and demonstrates robust performance across cumulative and held-out concentrations, aided by explicit and implicit concentration encoding and embedding averaging. These results suggest significant potential for virtual phenomics screening to accelerate drug discovery, with broad implications for multi-modal medical ML and biologically grounded representation learning.

Abstract

Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellular morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem ofContrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1x improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy. These results open the door for machine learning to be applied in virtual phenomics screening, which can significantly benefit drug discovery applications.
Paper Structure (26 sections, 5 equations, 10 figures, 22 tables)

This paper contains 26 sections, 5 equations, 10 figures, 22 tables.

Figures (10)

  • Figure 1: Illustration of proposed guidelines when incorporated in our MolPhenix contrastive phenomolecular retrieval framework. We address challenges by utilizing uni-modal pretrained MAE & MPNN models, inter-sample weighting with a dosage aware S2L loss, undersampling inactive molecules, and encoding molecular concentration.
  • Figure 2: Illustration of the contrastive phenomolecular retrieval challenge. Image $\mathbf{x}_i$ and a set of molecules and corresponding concentrations $(\mathbf{m}_k, \mathbf{c}_k)$ get mapped into a $\mathbb{R}^d$ latent space. Their similarities get computed with $f_{sim}$ and ranked to evaluate whether the paired perturbation appears in the top K%.
  • Figure 3: Data generation process of a phenomic experiment on cells $\mathbf{x_i}$ with molecular perturbations $\mathbf{m_i}$ and concentrations $\mathbf{c_i}$.
  • Figure 4: Comparison of training phenomic encoder from scratch and utilizing pre-trained Phenom1 unseen dataset. X-axis plotted on logarithmic scale.
  • Figure 5: Ablations of top-1 % recall accuracy with (bottom-left) cutoff $p$ value, (bottom-center) fingerprint type, and (bottom-right) embedding averaging.
  • ...and 5 more figures