Table of Contents
Fetching ...

Learning Molecular Representation in a Cell

Gang Liu, Srijit Seal, John Arevalo, Zhenwen Liang, Anne E. Carpenter, Meng Jiang, Shantanu Singh

TL;DR

The Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells is introduced and it is demonstrated that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods.

Abstract

Predicting drug efficacy and safety in vivo requires information on biological responses (e.g., cell morphology and gene expression) to small molecule perturbations. However, current molecular representation learning methods do not provide a comprehensive view of cell states under these perturbations and struggle to remove noise, hindering model generalization. We introduce the Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells. We integrate molecules and cellular response data as nodes into a context graph, connecting them with weighted edges based on chemical, biological, and computational criteria. For each molecule in a training batch, InfoAlign optimizes the encoder's latent representation with a minimality objective to discard redundant structural information. A sufficiency objective decodes the representation to align with different feature spaces from the molecule's neighborhood in the context graph. We demonstrate that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods. Empirically, we validate representations from InfoAlign in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching.

Learning Molecular Representation in a Cell

TL;DR

The Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells is introduced and it is demonstrated that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods.

Abstract

Predicting drug efficacy and safety in vivo requires information on biological responses (e.g., cell morphology and gene expression) to small molecule perturbations. However, current molecular representation learning methods do not provide a comprehensive view of cell states under these perturbations and struggle to remove noise, hindering model generalization. We introduce the Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells. We integrate molecules and cellular response data as nodes into a context graph, connecting them with weighted edges based on chemical, biological, and computational criteria. For each molecule in a training batch, InfoAlign optimizes the encoder's latent representation with a minimality objective to discard redundant structural information. A sufficiency objective decodes the representation to align with different feature spaces from the molecule's neighborhood in the context graph. We demonstrate that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods. Empirically, we validate representations from InfoAlign in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching.
Paper Structure (29 sections, 1 theorem, 14 equations, 7 figures, 5 tables)

This paper contains 29 sections, 1 theorem, 14 equations, 7 figures, 5 tables.

Key Result

Proposition 4.1

For the molecular representation $Z$ and target $Y$ (from cell morphology, gene expressions, or molecular fingerprints), the encoder-based MI lower bound $I_{ELB}$ for InfoNCE can be derived by incorporating $K-1$ additional samples, denoted as $y_{2:K}$, to build the Monte Carlo estimate $m(\cdot)$ where $h(z, y)$ is the neural network parameterized critic for density approximation with the energ

Figures (7)

  • Figure 1: Comparison of Representation Learning Methods: (a) Existing contrastive methods use two encoders—one for molecules and another for cell morphology or gene expression features—without sharing the molecule encoders for different alignment targets. (b) InfoAlign remove redundant information from molecules, cell morphology, and gene expressions based on the information bottleneck, resulting in more concise yet predictive molecular representations alemi2016deep.
  • Figure 2: Molecular Representation Learning Using the Context Graph: (a) In \ref{['subsec:context-graph-walk']}, we construct the graph with various interaction, perturbation, and cosine similarities among molecules $x$, cell morphology profiles $c$, and genes $e$. Given a training batch of molecules, including $x_1$ and $x_4$, random walk extracts paths, for instance, of length four. (b) In \ref{['subsec:represent-learn']}, we aim to learn molecular representations based on the information bottleneck, preserving minimal information from the input molecule while ensuring sufficient information for decoding the target along the walk path $\mathcal{P}_x$.
  • Figure 3: Percentage of Tasks Where Representations Excel: We compare the relative performance of three single representation (Single Rep.) approaches (molecular structure, cell morphology, and gene expression) and three aligned representations (Aligned Rep.): InfoAlign, CLOOME, InfoCORE.
  • Figure 4: Analysis on the hyperparameters: strength of prior $\beta$ and random walk length $L$. AUC is computed on the test set of ChEMBL2K.
  • Figure 5: From the initial idea in \ref{['sec:method']} to the practical implementation of the context graph, we first display relations between molecules and all the landmark genes from wang2016drug for the $X_1 - E_3$ and $X_3 - E_2$ relationships. $E_3$ and $E_2$ are landmark genes involved in small molecule perturbations and cell morphology perturbation; we display them separately for clarity. Next, we merge all landmark genes into new gene expression nodes and integrate genes from genetic perturbations in the JUMP dataset chandrasekaran2023jump with cell morphology features. Practical considerations are detailed in \ref{['sec:pretrain-setup', 'sec:more-context-graph']}.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 4.1