Table of Contents
Fetching ...

T-Rex: Text-assisted Retrosynthesis Prediction

Yifeng Liu, Hanwen Xu, Tangqi Fang, Haocheng Xi, Zixuan Liu, Sheng Zhang, Hoifung Poon, Sheng Wang

TL;DR

Retrosynthesis prediction faces vast search spaces when inferring reactants from targets. T-Rex addresses this by integrating textual descriptions generated by large language models with molecular graphs in a two-stage ranking framework: candidate reaction centers are first identified using text-and-graph features, then re-ranked using reactant descriptions to refine predictions. The approach yields substantial gains over state-of-the-art template-free models on USPTO-50k and USPTO-MIT, and demonstrates strong cross-dataset generalization and robustness to rare reaction types. This work demonstrates that leveraging language-model reasoning via text can meaningfully augment computational chemistry tasks, opening avenues for broader text-augmented molecular prediction systems.

Abstract

As a fundamental task in computational chemistry, retrosynthesis prediction aims to identify a set of reactants to synthesize a target molecule. Existing template-free approaches only consider the graph structures of the target molecule, which often cannot generalize well to rare reaction types and large molecules. Here, we propose T-Rex, a text-assisted retrosynthesis prediction approach that exploits pre-trained text language models, such as ChatGPT, to assist the generation of reactants. T-Rex first exploits ChatGPT to generate a description for the target molecule and rank candidate reaction centers based both the description and the molecular graph. It then re-ranks these candidates by querying the descriptions for each reactants and examines which group of reactants can best synthesize the target molecule. We observed that T-Rex substantially outperformed graph-based state-of-the-art approaches on two datasets, indicating the effectiveness of considering text information. We further found that T-Rex outperformed the variant that only use ChatGPT-based description without the re-ranking step, demonstrate how our framework outperformed a straightforward integration of ChatGPT and graph information. Collectively, we show that text generated by pre-trained language models can substantially improve retrosynthesis prediction, opening up new avenues for exploiting ChatGPT to advance computational chemistry. And the codes can be found at https://github.com/lauyikfung/T-Rex.

T-Rex: Text-assisted Retrosynthesis Prediction

TL;DR

Retrosynthesis prediction faces vast search spaces when inferring reactants from targets. T-Rex addresses this by integrating textual descriptions generated by large language models with molecular graphs in a two-stage ranking framework: candidate reaction centers are first identified using text-and-graph features, then re-ranked using reactant descriptions to refine predictions. The approach yields substantial gains over state-of-the-art template-free models on USPTO-50k and USPTO-MIT, and demonstrates strong cross-dataset generalization and robustness to rare reaction types. This work demonstrates that leveraging language-model reasoning via text can meaningfully augment computational chemistry tasks, opening avenues for broader text-augmented molecular prediction systems.

Abstract

As a fundamental task in computational chemistry, retrosynthesis prediction aims to identify a set of reactants to synthesize a target molecule. Existing template-free approaches only consider the graph structures of the target molecule, which often cannot generalize well to rare reaction types and large molecules. Here, we propose T-Rex, a text-assisted retrosynthesis prediction approach that exploits pre-trained text language models, such as ChatGPT, to assist the generation of reactants. T-Rex first exploits ChatGPT to generate a description for the target molecule and rank candidate reaction centers based both the description and the molecular graph. It then re-ranks these candidates by querying the descriptions for each reactants and examines which group of reactants can best synthesize the target molecule. We observed that T-Rex substantially outperformed graph-based state-of-the-art approaches on two datasets, indicating the effectiveness of considering text information. We further found that T-Rex outperformed the variant that only use ChatGPT-based description without the re-ranking step, demonstrate how our framework outperformed a straightforward integration of ChatGPT and graph information. Collectively, we show that text generated by pre-trained language models can substantially improve retrosynthesis prediction, opening up new avenues for exploiting ChatGPT to advance computational chemistry. And the codes can be found at https://github.com/lauyikfung/T-Rex.
Paper Structure (22 sections, 10 equations, 6 figures, 12 tables)

This paper contains 22 sections, 10 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Illustration of the retrosynthesis prediction. We formulate retrosynthesis prediction as a two-step approach. First, we identify the bond that splits the target molecule into two synthons. This step is formulated as a multi-class classification problem. Second, each synthon is used to generate a reactant. This step is formulated as a graph-to-graph generation problem.
  • Figure 2: Diagram of T-Rex. T-Rex is a two-stage approach. In the first stage, we use ChatGPT to generate a description for the target product. We then integrate this description and the molecular graph to obtain a few candidate reaction centers. In the second stage, we use ChatGPT to obtain a description for each synthon based on each candidate reaction center. The descriptions of two synthons are used together to re-rank the candidate reaction centers.
  • Figure 3: Performance on with-dataset cross-validation. Top-k exact match accuracy on USPTO-50k and filtered USPTO-MIT datasets when reaction class is not given.
  • Figure 4: Comparison on cross-dataset retrosynthesis prediction. Top 1, 3 and 5 exact match accuracy for step experiments of G2Gs, G2Gs+Text and our T-Rex model w.r.t. the percentage of the proportion of training set in filtered USPTO-MIT added for training.
  • Figure 5: Visualization of the embedding space of three models. G2Gs and G2Gs+Text show the embeddings from the reaction center identification stage. T-Rex shows the embeddings from the re-ranking stage, which are unavailable for the other two methods.
  • ...and 1 more figures