Table of Contents
Fetching ...

Electron flow matching for generative reaction mechanism prediction obeying conservation laws

Joonyoung F. Joung, Mun Hong Fong, Nicholas Casetti, Jordan P. Liles, Ne S. Dassanayake, Connor W. Coley

TL;DR

This work introduces FlowER, a conservation‑aware generative framework that treats chemical reactions as electron redistribution captured by a Bond‑Electron ($BE$) matrix. By applying flow matching to learn a time‑dependent vector field, FlowER enforces exact mass conservation ($\sum \Delta BE =0$) and yields interpretable, mechanistic step predictions that align with textbook chemistry. Empirically, FlowER outperforms sequence‑based baselines in structural validity and pathway coverage, demonstrates strong generalization to unseen reaction classes with data‑efficient fine‑tuning, and provides a natural interface for thermodynamic or kinetic feasibility assessments via downstream quantum calculations. This conservation‑conscious approach bridges predictive accuracy with mechanistic understanding, offering a robust tool for synthesis planning and reaction discovery, while inviting expansion of mechanistic datasets to broaden its coverage.

Abstract

Central to our understanding of chemical reactivity is the principle of mass conservation, which is fundamental for ensuring physical consistency, balancing equations, and guiding reaction design. However, data-driven computational models for tasks such as reaction product prediction rarely abide by this most basic constraint. In this work, we recast the problem of reaction prediction as a problem of electron redistribution using the modern deep generative framework of flow matching. Our model, FlowER, overcomes limitations inherent in previous approaches by enforcing exact mass conservation, thereby resolving hallucinatory failure modes, recovering mechanistic reaction sequences for unseen substrate scaffolds, and generalizing effectively to out-of-domain reaction classes with extremely data-efficient fine-tuning. FlowER additionally enables estimation of thermodynamic or kinetic feasibility and manifests a degree of chemical intuition in reaction prediction tasks. This inherently interpretable framework represents a significant step in bridging the gap between predictive accuracy and mechanistic understanding in data-driven reaction outcome prediction.

Electron flow matching for generative reaction mechanism prediction obeying conservation laws

TL;DR

This work introduces FlowER, a conservation‑aware generative framework that treats chemical reactions as electron redistribution captured by a Bond‑Electron () matrix. By applying flow matching to learn a time‑dependent vector field, FlowER enforces exact mass conservation () and yields interpretable, mechanistic step predictions that align with textbook chemistry. Empirically, FlowER outperforms sequence‑based baselines in structural validity and pathway coverage, demonstrates strong generalization to unseen reaction classes with data‑efficient fine‑tuning, and provides a natural interface for thermodynamic or kinetic feasibility assessments via downstream quantum calculations. This conservation‑conscious approach bridges predictive accuracy with mechanistic understanding, offering a robust tool for synthesis planning and reaction discovery, while inviting expansion of mechanistic datasets to broaden its coverage.

Abstract

Central to our understanding of chemical reactivity is the principle of mass conservation, which is fundamental for ensuring physical consistency, balancing equations, and guiding reaction design. However, data-driven computational models for tasks such as reaction product prediction rarely abide by this most basic constraint. In this work, we recast the problem of reaction prediction as a problem of electron redistribution using the modern deep generative framework of flow matching. Our model, FlowER, overcomes limitations inherent in previous approaches by enforcing exact mass conservation, thereby resolving hallucinatory failure modes, recovering mechanistic reaction sequences for unseen substrate scaffolds, and generalizing effectively to out-of-domain reaction classes with extremely data-efficient fine-tuning. FlowER additionally enables estimation of thermodynamic or kinetic feasibility and manifests a degree of chemical intuition in reaction prediction tasks. This inherently interpretable framework represents a significant step in bridging the gap between predictive accuracy and mechanistic understanding in data-driven reaction outcome prediction.

Paper Structure

This paper contains 13 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: a. Representation of flow in a hypothetical 1D chemical space. Starting from the initial reactant electron configuration on the far left, the reaction progresses probabilistically through intermediate states to produce a distribution over products on the far right. Each molecule occupies a position in a 1D chemical space, representing its state during the reaction. The transition between states represents the flow of the reaction, wherein the electron configuration is gradually redistributed, ultimately transforming the reactant into the product. b. To enforce strict conservation, FlowER formalizes a chemical reaction as the redistribution of valence electrons between atoms present among the reacting species. The state of a system with fixed atomic identities is represented by a bond-electron (BE) matrix as championed by Ugi in the 1970s 10.1007/BFb0051317ugi1993computer. The electron redistribution process is learned through flow matching, and electrons are conserved throughout. c. FlowER takes in a "transient" state at any given timepoint as input. The molecular structure is represented by a BE matrix and a matrix of atom features. This information is processed by a series of transformer blocks employing a multi-head attention mechanism to finally predict the changes in lone pair and bond electrons, the sum of which is constrained to be zero.
  • Figure 1: a. Flow matching is a recently-developed generative technique tong2023conditional that is well suited to learning the process of electron redistribution. This diagram illustrates the flow matching training process for the reaction of epoxide and amine and it's electron distribution in Fig.\ref{['fig1']}b. It highlights the changes of electron count on lone pairs and bonds, steered by a vector field (black arrow), from $t=0$ (reactant) to $t=1$ (product). b. Provided a set of reacting species represented by their bond-electron matrix, a trained model predicts the redistribution of electrons in an iterative manner that ultimately leads to the prediction of one mechanistic step. This can be repeated to provide a full multi-step sequence resulting in a stable product.
  • Figure 2: a. Validity and conservation performance of FlowER compared to Graph2SMILES without (G2S) or with (G2S+H) Kekulé structures and explicit hydrogens. FlowER achieves higher rates of structure validity, heavy atom conservation, proton conservation, and electron conservation between reactants and products. b. Top-k step accuracy for predicting the product of a single elementary step. c. Top-k pathway accuracy for predicting the full reaction pathway during beam search, showing the minimum beam width (k) required to capture the correct sequence of elementary steps. d. FlowER predictions and corresponding numbers of elementary steps for an example amide condensation under different reaction conditions: without a catalyst, using only DCC (N,N'-Dicyclohexylcarbodiimide), using DCC with HOBt (1-Hydroxybenzotriazole), using CDI (Carbonyldiimidazole), using BOP (Benzotriazol-1-yloxytris(dimethylamino)phosphonium hexafluorophosphate), and using HATU (O-(7-Azabenzotriazol-1-yl)-N,N,N',N'-tetramethyluronium hexafluorophosphate). FlowER successfully predicts the major products and byproducts for all six cases; some species are left activated and would be subsequently neutralized.
  • Figure 2: a. An experimental reaction example reported in the patent literature in 2024 US20240150295US20240150296, which corresponds to a reaction type not seen during training. b. G2S predicts top 1 and 2 mechanistic sequence that either remains unreactive or terminates prematurely. Although G2S manages to predict one of the major product through a lower ranked pathway, it still exhibits "alchemy" which violates mass conservation (highlighted in red where violations occur). The numbers above each arrow represent the rank of that reaction as proposed by G2S.
  • Figure 3: a. Analysis of FlowER predictions as a function of choice of base/nucleophile. The reaction can proceed via two distinct pathways: the keto alpha-alkylation pathway and the S$_\text{N}$2 pathway. The choice between these pathways depends on whether the base functions primarily as a base or as a nucleophile. The blue box shows whether FlowER proposes the final product for each pathway when using various bases, annotated with their conjugate acid $\text{p}K_{\mathrm{a}}$. If both pathways are marked with an 'X', it means that FlowER predicted no reaction. b. An example reaction reported in 2024 US20240150295US20240150296 (Extended Fig. \ref{['G2S_prediction']}a), where FlowER successfully reproduces the two experimentally-recorded products in ten sequential steps. The numbers above each arrow represent the number of times that reaction was proposed during 32 independent sampling steps. The complete reaction pathway predicted by FlowER is provided in Extended Fig. \ref{['FlowER_prediction']}. c. G$_{rel}$ values of each state, representing the Gibbs free energies, calculated using the B3LYP/6-311G level of theory with water as the solvent, modeled via the SMD method.The reaction coordinate spans from the initial reactant state at $x = 1$, to the final product state at $x = 11$, with intermediate steps corresponding to mechanistic transformations predicted by FlowER. The green pathway corresponds to the transformation leading to Product 1, while the blue pathway represents the route toward Product 2. All energy values are referenced relative to the reactant state set at 0 kcal/mol.
  • ...and 3 more figures