Table of Contents
Fetching ...

Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

Chenguang Wang, Zihan Zhou, Lei Bai, Tianshu Yu

TL;DR

This work proposes a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias, and achieves state-of-the-art performance on both USPTO-50k and the large-scale USPTO-Full with predicted reaction centers.

Abstract

Template-free retrosynthesis methods treat the task as black-box sequence generation, limiting learning efficiency, while semi-template approaches rely on rigid reaction libraries that constrain generalization. We address this gap with a key insight: atom ordering in neural representations matters. Building on this insight, we propose a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias. By placing reaction center atoms at the sequence head, our method transforms implicit chemical knowledge into explicit positional patterns that the model can readily capture. The proposed RetroDiT backbone, a graph transformer with rotary position embeddings, exploits this ordering to prioritize chemically critical regions. Combined with discrete flow matching, our approach decouples training from sampling and enables generation in 20--50 steps versus 500 for prior diffusion methods. Our method achieves state-of-the-art performance on both USPTO-50k (61.2% top-1) and the large-scale USPTO-Full (51.3% top-1) with predicted reaction centers. With oracle centers, performance reaches 71.1% and 63.4% respectively, surpassing foundation models trained on 10 billion reactions while using orders of magnitude less data. Ablation studies further reveal that structural priors outperform brute-force scaling: a 280K-parameter model with proper ordering matches a 65M-parameter model without it.

Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

TL;DR

This work proposes a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias, and achieves state-of-the-art performance on both USPTO-50k and the large-scale USPTO-Full with predicted reaction centers.

Abstract

Template-free retrosynthesis methods treat the task as black-box sequence generation, limiting learning efficiency, while semi-template approaches rely on rigid reaction libraries that constrain generalization. We address this gap with a key insight: atom ordering in neural representations matters. Building on this insight, we propose a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias. By placing reaction center atoms at the sequence head, our method transforms implicit chemical knowledge into explicit positional patterns that the model can readily capture. The proposed RetroDiT backbone, a graph transformer with rotary position embeddings, exploits this ordering to prioritize chemically critical regions. Combined with discrete flow matching, our approach decouples training from sampling and enables generation in 20--50 steps versus 500 for prior diffusion methods. Our method achieves state-of-the-art performance on both USPTO-50k (61.2% top-1) and the large-scale USPTO-Full (51.3% top-1) with predicted reaction centers. With oracle centers, performance reaches 71.1% and 63.4% respectively, surpassing foundation models trained on 10 billion reactions while using orders of magnitude less data. Ablation studies further reveal that structural priors outperform brute-force scaling: a 280K-parameter model with proper ordering matches a 65M-parameter model without it.
Paper Structure (51 sections, 14 equations, 13 figures, 11 tables, 2 algorithms)

This paper contains 51 sections, 14 equations, 13 figures, 11 tables, 2 algorithms.

Figures (13)

  • Figure 1: Overview of the structure-aware retrosynthesis framework.Left (Training): Ground-truth reaction centers are identified via atom mapping, generating multiple ordered graph candidates $G_0$ with different RC atoms as roots. Discrete flow matching trains RetroDiT to predict the target reactants $G_1$ from intermediate states $G_t$. Right (Inference): An R-GCN predicts reaction center probabilities, and top-$k$ candidates (4 in the illustration) are used to generate ordered graphs. CTMC sampling progressively transforms the product ($t=0$) into reactants ($t=1$). The training and inference pipelines are decoupled, allowing independent optimization of each component.
  • Figure 2: Positional Inductive Bias vs. Model Scaling on USPTO-Full. Performance comparison across four model sizes (280K to 65M). RC-Rooted ordering consistently outperforms the Canonical baseline. The Oracle setting with a Small model (280K) matches the Canonical X-Large model (65M), showing that structure-aware priors are more parameter-efficient than scaling alone.
  • Figure 3: Impact of RC Prediction Accuracy on Generation Performance. The solid lines track the Top-1 Accuracy of our RC-Rooted model as the reaction center accuracy varies from 10% to 100%. The dashed lines represent the performance of the Canonical ordering baseline (which is independent of RC prediction).
  • Figure 4: Distribution of Reaction Center Types. Percentage of reactions containing each chemical change type. Note the distinct distributional shifts between datasets, particularly in bond formation/breaking patterns.
  • Figure 5: Cumulative Reaction Coverage. Change types ordered by prevalence. The top-2 types alone cover over 99% of reactions in both datasets.
  • ...and 8 more figures