Table of Contents
Fetching ...

Procedural Synthesis of Synthesizable Molecules

Michael Sun, Alston Lo, Minghao Guo, Jie Chen, Connor Coley, Wojciech Matusik

TL;DR

This work redefines synthesizable molecular design and analog generation as conditional program-synthesis problems and introduces a bi-level optimization framework that decouples syntactic skeletons from chemical semantics. An outer Metropolis-Hastings search over skeletons and an inner horizon-aware decoding loop, implemented with graph neural policies, enable efficient exploration of synthesis pathways within a fixed grammar, further complemented by a GA for multi-objective design. Across analog generation and molecule design tasks, the approach yields higher reconstructive accuracy, greater diversity, and improved synthetic accessibility, with strong docking performance and notable sample-efficiency gains. The results demonstrate that explicit control over synthesis resources and templates can significantly accelerate discovery workflows and is well-suited for integration with autonomous synthesis platforms.

Abstract

Designing synthetically accessible molecules and recommending analogs to unsynthesizable molecules are important problems for accelerating molecular discovery. We reconceptualize both problems using ideas from program synthesis. Drawing inspiration from syntax-guided synthesis approaches, we decouple the syntactic skeleton from the semantics of a synthetic tree to create a bilevel framework for reasoning about the combinatorial space of synthesis pathways. Given a molecule we aim to generate analogs for, we iteratively refine its skeletal characteristics via Markov Chain Monte Carlo simulations over the space of syntactic skeletons. Given a black-box oracle to optimize, we formulate a joint design space over syntactic templates and molecular descriptors and introduce evolutionary algorithms that optimize both syntactic and semantic dimensions synergistically. Our key insight is that once the syntactic skeleton is set, we can amortize over the search complexity of deriving the program's semantics by training policies to fully utilize the fixed horizon Markov Decision Process imposed by the syntactic template. We demonstrate performance advantages of our bilevel framework for synthesizable analog generation and synthesizable molecule design. Notably, our approach offers the user explicit control over the resources required to perform synthesis and biases the design space towards simpler solutions, making it particularly promising for autonomous synthesis platforms. Code is at https://github.com/shiningsunnyday/SynthesisNet.

Procedural Synthesis of Synthesizable Molecules

TL;DR

This work redefines synthesizable molecular design and analog generation as conditional program-synthesis problems and introduces a bi-level optimization framework that decouples syntactic skeletons from chemical semantics. An outer Metropolis-Hastings search over skeletons and an inner horizon-aware decoding loop, implemented with graph neural policies, enable efficient exploration of synthesis pathways within a fixed grammar, further complemented by a GA for multi-objective design. Across analog generation and molecule design tasks, the approach yields higher reconstructive accuracy, greater diversity, and improved synthetic accessibility, with strong docking performance and notable sample-efficiency gains. The results demonstrate that explicit control over synthesis resources and templates can significantly accelerate discovery workflows and is well-suited for integration with autonomous synthesis platforms.

Abstract

Designing synthetically accessible molecules and recommending analogs to unsynthesizable molecules are important problems for accelerating molecular discovery. We reconceptualize both problems using ideas from program synthesis. Drawing inspiration from syntax-guided synthesis approaches, we decouple the syntactic skeleton from the semantics of a synthetic tree to create a bilevel framework for reasoning about the combinatorial space of synthesis pathways. Given a molecule we aim to generate analogs for, we iteratively refine its skeletal characteristics via Markov Chain Monte Carlo simulations over the space of syntactic skeletons. Given a black-box oracle to optimize, we formulate a joint design space over syntactic templates and molecular descriptors and introduce evolutionary algorithms that optimize both syntactic and semantic dimensions synergistically. Our key insight is that once the syntactic skeleton is set, we can amortize over the search complexity of deriving the program's semantics by training policies to fully utilize the fixed horizon Markov Decision Process imposed by the syntactic template. We demonstrate performance advantages of our bilevel framework for synthesizable analog generation and synthesizable molecule design. Notably, our approach offers the user explicit control over the resources required to perform synthesis and biases the design space towards simpler solutions, making it particularly promising for autonomous synthesis platforms. Code is at https://github.com/shiningsunnyday/SynthesisNet.
Paper Structure (52 sections, 4 equations, 18 figures, 11 tables, 1 algorithm)

This paper contains 52 sections, 4 equations, 18 figures, 11 tables, 1 algorithm.

Figures (18)

  • Figure 1: Program synthesis terminology for modeling synthesis pathways.
  • Figure 2: (Left) Our Metropolis-Hastings algorithm in Section \ref{['sec:3-3']} iteratively refines the syntax tree skeleton towards the stationary distribution which is proportional to the inverse distance to our target molecule $M$. (Right) Our genetic algorithm over the joint design space ${\mathcal{X}}\times{\mathcal{T}}$ in Section \ref{['sec:3-4']} combines the strategies of semantic crossover ($\rightarrow$) and syntactical mutation ($\textcolor{red}{\rightarrow}$) to encourage both global improvement and local exploration.
  • Figure 3: Illustration of our decoding scheme $F$: (Left) The input is a Morgan fingerprint ${\bm{x}}$ and syntax skeleton $T$; (Middle) Decode once for every topological ordering of the tree, tracking all partial programs with a stack; (Right) Execute all decoded programs, then returning the closest analog which minimizes distance to ${\bm{x}}$.
  • Figure 4: We adopt the tree edit distance as the $\text{dist}$ function. We see that $\hat{\mathcal{T}}_4$ has sufficient transition coverage for bootstrapping our space of syntactic templates.
  • Figure 5: (a) Summary statistics of the number of syntactic templates (both empirical and theoretically possible) and possible topological decoding node orders for $k=1,2,\ldots,5$; (b) Summary statistics for only the number of syntactic templates since enumerating all topological sorts becomes intractable; (c) Summary statistics for the number of topological masks (subset of nodes closed under parent(.))
  • ...and 13 more figures