Table of Contents
Fetching ...

Junction Tree Variational Autoencoder for Molecular Graph Generation

Wengong Jin, Regina Barzilay, Tommi Jaakkola

TL;DR

The paper addresses automated molecular design by learning continuous embeddings and directly generating molecular graphs. It introduces the Junction Tree Variational Autoencoder (JT-VAE), which first builds a junction-tree scaffold of valid subgraphs and then assembles them into a full molecule, enforcing chemical validity throughout generation. The model jointly learns a two-part latent space, $oldsymbol{z}=[oldsymbol{z}_{oldsymbol{ au}}, oldsymbol{z}_G]$, corresponding to the tree structure and the fine-grained graph, encoded via a tree and graph encoder and decoded through a tree and graph decoder. Empirically, JT-VAE outperforms SMILES-based baselines on generation and optimization tasks, achieving $100 ext{%}$ prior validity, strong molecule reconstruction, and superior results in Bayesian optimization and constrained optimization, highlighting its practical impact for scalable, valid molecular graph generation.

Abstract

We seek to automate the design of molecules based on specific chemical properties. In computational terms, this task involves continuous embedding and generation of molecular graphs. Our primary contribution is the direct realization of molecular graphs, a task previously approached by generating linear SMILES strings instead of graphs. Our junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network. This approach allows us to incrementally expand molecules while maintaining chemical validity at every step. We evaluate our model on multiple tasks ranging from molecular generation to optimization. Across these tasks, our model outperforms previous state-of-the-art baselines by a significant margin.

Junction Tree Variational Autoencoder for Molecular Graph Generation

TL;DR

The paper addresses automated molecular design by learning continuous embeddings and directly generating molecular graphs. It introduces the Junction Tree Variational Autoencoder (JT-VAE), which first builds a junction-tree scaffold of valid subgraphs and then assembles them into a full molecule, enforcing chemical validity throughout generation. The model jointly learns a two-part latent space, , corresponding to the tree structure and the fine-grained graph, encoded via a tree and graph encoder and decoded through a tree and graph decoder. Empirically, JT-VAE outperforms SMILES-based baselines on generation and optimization tasks, achieving prior validity, strong molecule reconstruction, and superior results in Bayesian optimization and constrained optimization, highlighting its practical impact for scalable, valid molecular graph generation.

Abstract

We seek to automate the design of molecules based on specific chemical properties. In computational terms, this task involves continuous embedding and generation of molecular graphs. Our primary contribution is the direct realization of molecular graphs, a task previously approached by generating linear SMILES strings instead of graphs. Our junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network. This approach allows us to incrementally expand molecules while maintaining chemical validity at every step. We evaluate our model on multiple tasks ranging from molecular generation to optimization. Across these tasks, our model outperforms previous state-of-the-art baselines by a significant margin.

Paper Structure

This paper contains 17 sections, 13 equations, 14 figures, 4 tables, 2 algorithms.

Figures (14)

  • Figure 1: Two almost identical molecules with markedly different canonical SMILES in RDKit. The edit distance between two strings is 22 (50.5% of the whole sequence).
  • Figure 2: Comparison of two graph generation schemes: Structure by structure approach is preferred as it avoids invalid intermediate states (marked in red) encountered in node by node approach.
  • Figure 3: Overview of our method: A molecular graph $G$ is first decomposed into its junction tree $\mathcal{T}_G$, where each colored node in the tree represents a substructure in the molecule. We then encode both the tree and graph into their latent embeddings $\mathbf{z}_\mathcal{T}$ and $\mathbf{z}_G$. To decode the molecule, we first reconstruct junction tree from $\mathbf{z}_\mathcal{T}$, and then assemble nodes in the tree back to the original molecule.
  • Figure 4: Illustration of the tree decoding process. Nodes are labeled in the order in which they are generated. 1) Node 2 expands child node 4 and predicts its label with message $\mathbf{h}_{24}$. 2) As node 4 is a leaf node, decoder backtracks and computes message $\mathbf{h}_{42}$. 3) Decoder continues to backtrack as node 2 has no more children. 4) Node 1 expands node 5 and predicts its label.
  • Figure 5: Decode a molecule from a junction tree. 1) Ground truth molecule $G$. 2) Predicted junction tree $\widehat{\mathcal{T}}$. 3) We enumerate different combinations between red cluster $C$ and its neighbors. Crossed arrows indicate combinations that lead to chemically infeasible molecules. Note that if we discard tree structure during enumeration (i.e., ignoring subtree A), the last two candidates will collapse into the same molecule. 4) Rank subgraphs at each node. The final graph is decoded by putting together all the predicted subgraphs (dashed box).
  • ...and 9 more figures