Table of Contents
Fetching ...

Efficient and Programmable Exploration of Synthesizable Chemical Space

Shitong Luo, Connor W. Coley

TL;DR

Efficient and Programmable Exploration of Synthesizable Chemical Space introduces PrexSyn, a decoder-only Transformer that generates postfix synthesis representations conditioned on molecular properties, ensuring synthesizability. It leverages a high-throughput C++ data engine to train on billion-scale pathway-property data, achieving near-complete coverage of the Enamine REAL space with much faster inference than prior methods. It supports composite property queries using AND/NOT/OR logic via a product-of-experts sampling scheme and enables query-space optimization against black-box docking oracles, improving sampling efficiency. Empirical results show state-of-the-art reconstruction and similarity on Enamine/ChEMBL, superior GuacaMol performance, and effective docking-based molecular optimization (sEH and Mpro2). Overall, PrexSyn advances synthesizable molecular design by combining high-coverage sampling, fast inference, and programmable objective specification.

Abstract

The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C++-based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program'' generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed.

Efficient and Programmable Exploration of Synthesizable Chemical Space

TL;DR

Efficient and Programmable Exploration of Synthesizable Chemical Space introduces PrexSyn, a decoder-only Transformer that generates postfix synthesis representations conditioned on molecular properties, ensuring synthesizability. It leverages a high-throughput C++ data engine to train on billion-scale pathway-property data, achieving near-complete coverage of the Enamine REAL space with much faster inference than prior methods. It supports composite property queries using AND/NOT/OR logic via a product-of-experts sampling scheme and enables query-space optimization against black-box docking oracles, improving sampling efficiency. Empirical results show state-of-the-art reconstruction and similarity on Enamine/ChEMBL, superior GuacaMol performance, and effective docking-based molecular optimization (sEH and Mpro2). Overall, PrexSyn advances synthesizable molecular design by combining high-coverage sampling, fast inference, and programmable objective specification.

Abstract

The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C++-based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program'' generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed.

Paper Structure

This paper contains 24 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) The high-throughput C++ data engine generates synthetic pathways and computes molecular properties on the fly. It adopts a producer-consumer architecture, where multiple producer threads generate and featurize samples, which are then consumed by the Python-based training framework. (b) The decoder-only transformer architecture predicts the next token conditioned on property prompts. (c) Multiple properties in a composite query are used to condition the model separately. The resulting distributions are then combined. (d) Query space optimization. At each step, molecular properties are computed and perturbed to recondition the model. New molecules are then evaluated by oracle functions, and those with improved properties replace previous ones. (e) Since the structural properties used for training are sufficiently expressive to locate molecules, we can sample molecules with respect to general properties defined by black-box oracles by iteratively refining structural-property queries.
  • Figure 2: (a) Reconstruction rate/similarity versus inverse inference time on the Enamine and ChEMBL test sets. PrexSyn outperforms all baselines in both accuracy and efficiency. (b) Similarity of molecules projected by PrexSyn versus SynFormer on the ChEMBL test set. Each point corresponds to a molecule; points above the diagonal indicate higher similarity achieved by PrexSyn. The majority of points lie above the line, demonstrating consistently better performance. (c) Reconstruction rate versus training data scale. The reconstruction rate increases with larger training data, highlighting the benefit of large-scale training. (d) Example projection of a molecule from the ChEMBL test set that PrexSyn reconstructs perfectly. Tanimoto similarity using Morgan fingerprints is shown.
  • Figure 3: (a) Optimization trajectory, two snapshots, and six high-scoring samples for the Celecoxib Rediscovery task. PrexSyn successfully reconstructs Celecoxib and discovers high-scoring analogs. (b) Optimization trajectory on the scaffold hopping task with composite property queries. The composite query-conditioned optimization (green curve) achieves higher efficiency than the condition-free baseline (blue curve) and the optimization landscape is smoother. (c) Docking score optimization trajectory on the Mpro2 task. PrexSyn generates molecules that achieve improved docking scores compared to the baseline inhibitor (dashed line). The generated molecules share binding modes similar to the baseline inhibitor, including fitting into the highlighted subpocket.