Learning to Extend Molecular Scaffolds with Structural Motifs
Krzysztof Maziarz, Henry Jackson-Flux, Pashmina Cameron, Finton Sirockin, Nadine Schneider, Nikolaus Stiefl, Marwin Segler, Marc Brockschmidt
TL;DR
MoLeR introduces a graph-based molecular generator that seamlessly extends partial molecules by combining motif-based fragments with atom-by-atom steps, enabling scaffold-constrained generation without conditioning on the full generation history. It uses a data-derived motif vocabulary $\,\mathcal{M}$ and a latent code $z$ within a variational autoencoder framework, with a loss $\\mathcal{L} = \lambda_{prior} \\mathcal{L}_{prior} + \\mathcal{L}_{rec} + \\lambda_{prop} \\mathcal{L}_{prop}$ to train an encoder and a decoder that operates on a single, shared graph representation $h_{mol}$. Empirically, MoLeR matches state-of-the-art performance in unconstrained molecular generation and outperforms baselines on scaffold-constrained tasks, while offering an order-of-magnitude speed advantage in training and sampling. The approach also enables scaffold-aware optimization by pairing with Molecular Swarm Optimization (MSO) and provides insights into the effects of generation order and motif vocabulary size. Overall, MoLeR advances scalable, scaffold-respecting molecular design with efficient inference and robust latent-space behavior, contributing a versatile tool for drug discovery workflows that require fixed scaffolds.
Abstract
Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. Here, we propose MoLeR, a graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because it is not conditioned on the generation history. Our experiments show that MoLeR performs comparably to state-of-the-art methods on unconstrained molecular optimization tasks, and outperforms them on scaffold-based tasks, while being an order of magnitude faster to train and sample from than existing approaches. Furthermore, we show the influence of a number of seemingly minor design choices on the overall performance.
