Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

Chenguang Wang; Zihan Zhou; Lei Bai; Tianshu Yu

Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

Chenguang Wang, Zihan Zhou, Lei Bai, Tianshu Yu

TL;DR

This work proposes a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias, and achieves state-of-the-art performance on both USPTO-50k and the large-scale USPTO-Full with predicted reaction centers.

Abstract

Template-free retrosynthesis methods treat the task as black-box sequence generation, limiting learning efficiency, while semi-template approaches rely on rigid reaction libraries that constrain generalization. We address this gap with a key insight: atom ordering in neural representations matters. Building on this insight, we propose a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias. By placing reaction center atoms at the sequence head, our method transforms implicit chemical knowledge into explicit positional patterns that the model can readily capture. The proposed RetroDiT backbone, a graph transformer with rotary position embeddings, exploits this ordering to prioritize chemically critical regions. Combined with discrete flow matching, our approach decouples training from sampling and enables generation in 20--50 steps versus 500 for prior diffusion methods. Our method achieves state-of-the-art performance on both USPTO-50k (61.2% top-1) and the large-scale USPTO-Full (51.3% top-1) with predicted reaction centers. With oracle centers, performance reaches 71.1% and 63.4% respectively, surpassing foundation models trained on 10 billion reactions while using orders of magnitude less data. Ablation studies further reveal that structural priors outperform brute-force scaling: a 280K-parameter model with proper ordering matches a 65M-parameter model without it.

Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

TL;DR

Abstract

Paper Structure (51 sections, 14 equations, 13 figures, 11 tables, 2 algorithms)

This paper contains 51 sections, 14 equations, 13 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Background
Method
Structure-Aware Graph Representation
RetroDiT Architecture
Discrete Flow Matching for Retrosynthesis
Modular Inference Pipeline
Experiments
Experimental Setup
Main Results
Analysis: Positional Inductive Bias vs. Scaling
The Bottleneck: Impact of RC Prediction Accuracy
Conclusion and Future Work
Detailed Definition and Extraction of Reaction Centers
...and 36 more sections

Figures (13)

Figure 1: Overview of the structure-aware retrosynthesis framework.Left (Training): Ground-truth reaction centers are identified via atom mapping, generating multiple ordered graph candidates $G_0$ with different RC atoms as roots. Discrete flow matching trains RetroDiT to predict the target reactants $G_1$ from intermediate states $G_t$. Right (Inference): An R-GCN predicts reaction center probabilities, and top-$k$ candidates (4 in the illustration) are used to generate ordered graphs. CTMC sampling progressively transforms the product ($t=0$) into reactants ($t=1$). The training and inference pipelines are decoupled, allowing independent optimization of each component.
Figure 2: Positional Inductive Bias vs. Model Scaling on USPTO-Full. Performance comparison across four model sizes (280K to 65M). RC-Rooted ordering consistently outperforms the Canonical baseline. The Oracle setting with a Small model (280K) matches the Canonical X-Large model (65M), showing that structure-aware priors are more parameter-efficient than scaling alone.
Figure 3: Impact of RC Prediction Accuracy on Generation Performance. The solid lines track the Top-1 Accuracy of our RC-Rooted model as the reaction center accuracy varies from 10% to 100%. The dashed lines represent the performance of the Canonical ordering baseline (which is independent of RC prediction).
Figure 4: Distribution of Reaction Center Types. Percentage of reactions containing each chemical change type. Note the distinct distributional shifts between datasets, particularly in bond formation/breaking patterns.
Figure 5: Cumulative Reaction Coverage. Change types ordered by prevalence. The top-2 types alone cover over 99% of reactions in both datasets.
...and 8 more figures

Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

TL;DR

Abstract

Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (13)