Table of Contents
Fetching ...

Learning to Extend Molecular Scaffolds with Structural Motifs

Krzysztof Maziarz, Henry Jackson-Flux, Pashmina Cameron, Finton Sirockin, Nadine Schneider, Nikolaus Stiefl, Marwin Segler, Marc Brockschmidt

TL;DR

MoLeR introduces a graph-based molecular generator that seamlessly extends partial molecules by combining motif-based fragments with atom-by-atom steps, enabling scaffold-constrained generation without conditioning on the full generation history. It uses a data-derived motif vocabulary $\,\mathcal{M}$ and a latent code $z$ within a variational autoencoder framework, with a loss $\\mathcal{L} = \lambda_{prior} \\mathcal{L}_{prior} + \\mathcal{L}_{rec} + \\lambda_{prop} \\mathcal{L}_{prop}$ to train an encoder and a decoder that operates on a single, shared graph representation $h_{mol}$. Empirically, MoLeR matches state-of-the-art performance in unconstrained molecular generation and outperforms baselines on scaffold-constrained tasks, while offering an order-of-magnitude speed advantage in training and sampling. The approach also enables scaffold-aware optimization by pairing with Molecular Swarm Optimization (MSO) and provides insights into the effects of generation order and motif vocabulary size. Overall, MoLeR advances scalable, scaffold-respecting molecular design with efficient inference and robust latent-space behavior, contributing a versatile tool for drug discovery workflows that require fixed scaffolds.

Abstract

Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. Here, we propose MoLeR, a graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because it is not conditioned on the generation history. Our experiments show that MoLeR performs comparably to state-of-the-art methods on unconstrained molecular optimization tasks, and outperforms them on scaffold-based tasks, while being an order of magnitude faster to train and sample from than existing approaches. Furthermore, we show the influence of a number of seemingly minor design choices on the overall performance.

Learning to Extend Molecular Scaffolds with Structural Motifs

TL;DR

MoLeR introduces a graph-based molecular generator that seamlessly extends partial molecules by combining motif-based fragments with atom-by-atom steps, enabling scaffold-constrained generation without conditioning on the full generation history. It uses a data-derived motif vocabulary and a latent code within a variational autoencoder framework, with a loss to train an encoder and a decoder that operates on a single, shared graph representation . Empirically, MoLeR matches state-of-the-art performance in unconstrained molecular generation and outperforms baselines on scaffold-constrained tasks, while offering an order-of-magnitude speed advantage in training and sampling. The approach also enables scaffold-aware optimization by pairing with Molecular Swarm Optimization (MSO) and provides insights into the effects of generation order and motif vocabulary size. Overall, MoLeR advances scalable, scaffold-respecting molecular design with efficient inference and robust latent-space behavior, contributing a versatile tool for drug discovery workflows that require fixed scaffolds.

Abstract

Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. Here, we propose MoLeR, a graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because it is not conditioned on the generation history. Our experiments show that MoLeR performs comparably to state-of-the-art methods on unconstrained molecular optimization tasks, and outperforms them on scaffold-based tasks, while being an order of magnitude faster to train and sample from than existing approaches. Furthermore, we show the influence of a number of seemingly minor design choices on the overall performance.

Paper Structure

This paper contains 29 sections, 4 equations, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: Overview of our approach. We discover motifs from data (a) and use them to decompose an input molecule (b) into motifs and single atoms. In the encoder (c), atom features (bottom) are combined with motif embeddings (top), making the motif information available at the atom level. Decoder steps (d) are only conditioned on the encoder output and partial graph (hence independent) and have to select one of the valid options (shown below, correct choices marked in red).
  • Figure 2: Frechet ChemNet Distance (lower is better) for different generation orders and vocabulary sizes. We consider generation from scratch (left), and generation starting from a scaffold (right).
  • Figure 2: Results on 20 GuacaMol tasks (left) and 4 additional scaffold-based tasks (right). First five rows correspond to baselines from brown2019guacamol. We do not compute quality if less than 100 molecules per benchmark were found.
  • Figure 3: Scaffold from a GuacaMol benchmark (top) and a scaffold from our additional benchmark (bottom).
  • Figure 4: Comparison on tasks from lim2019scaffold. We show both single-property optimization tasks as well as one where all properties must be optimized simultaneously. We plot averages and standard error over $20$ runs for each task; each run uses a different scaffold and property targets.
  • ...and 10 more figures