Table of Contents
Fetching ...

BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

Yifei Yang, Runhan Shi, Zuchao Li, Shu Jiang, Bao-Liang Lu, Yang Yang, Hai Zhao

TL;DR

BatGPT-Chem addresses the challenge of robust retrosynthesis planning by integrating chemical knowledge and reaction conditions into a large bilingual LLM. The model uses a unified framework that treats natural language and SMILES as interconvertible, trained with instruction-tuned data from extensive public and private datasets, and explicitly prompts for reaction conditions. It delivers strong zero-shot performance, high reactant accuracy (MaxFrag), diversity of viable routes, and near-100% output validity, outperforming existing AI methods on multiple benchmarks. This work advances AI-driven synthetic design and provides a practical online platform to aid chemists in planning novel syntheses.

Abstract

Retrosynthesis analysis is pivotal yet challenging in drug discovery and organic chemistry. Despite the proliferation of computational tools over the past decade, AI-based systems often fall short in generalizing across diverse reaction types and exploring alternative synthetic pathways. This paper presents BatGPT-Chem, a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Integrating chemical tasks via a unified framework of natural language and SMILES notation, this approach synthesizes extensive instructional data from an expansive chemical database. Employing both autoregressive and bidirectional training techniques across over one hundred million instances, BatGPT-Chem captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions and exhibiting strong zero-shot capabilities. Superior to existing AI methods, our model demonstrates significant advancements in generating effective strategies for complex molecules, as validated by stringent benchmark tests. BatGPT-Chem not only boosts the efficiency and creativity of retrosynthetic analysis but also establishes a new standard for computational tools in synthetic design. This development empowers chemists to adeptly address the synthesis of novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science. We release our trial platform at \url{https://www.batgpt.net/dapp/chem}.

BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

TL;DR

BatGPT-Chem addresses the challenge of robust retrosynthesis planning by integrating chemical knowledge and reaction conditions into a large bilingual LLM. The model uses a unified framework that treats natural language and SMILES as interconvertible, trained with instruction-tuned data from extensive public and private datasets, and explicitly prompts for reaction conditions. It delivers strong zero-shot performance, high reactant accuracy (MaxFrag), diversity of viable routes, and near-100% output validity, outperforming existing AI methods on multiple benchmarks. This work advances AI-driven synthetic design and provides a practical online platform to aid chemists in planning novel syntheses.

Abstract

Retrosynthesis analysis is pivotal yet challenging in drug discovery and organic chemistry. Despite the proliferation of computational tools over the past decade, AI-based systems often fall short in generalizing across diverse reaction types and exploring alternative synthetic pathways. This paper presents BatGPT-Chem, a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Integrating chemical tasks via a unified framework of natural language and SMILES notation, this approach synthesizes extensive instructional data from an expansive chemical database. Employing both autoregressive and bidirectional training techniques across over one hundred million instances, BatGPT-Chem captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions and exhibiting strong zero-shot capabilities. Superior to existing AI methods, our model demonstrates significant advancements in generating effective strategies for complex molecules, as validated by stringent benchmark tests. BatGPT-Chem not only boosts the efficiency and creativity of retrosynthetic analysis but also establishes a new standard for computational tools in synthetic design. This development empowers chemists to adeptly address the synthesis of novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science. We release our trial platform at \url{https://www.batgpt.net/dapp/chem}.
Paper Structure (25 sections, 13 figures, 3 tables)

This paper contains 25 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: The annotated reaction graphs. The different fingerprints of reactions are visualized using a TMAP algorithm tmap and the Faerun visualization library faerun.
  • Figure 2: Top-10 MaxFrag accuracy of prediction of different datasets.
  • Figure 3: Top-10 Intersection accuracy of prediction of different datasets.
  • Figure 4: Comparison of predictions between BatGPT-Chem and ChemDFM where products are displayed in pink blocks, reactants are in green blocks, and reaction conditions are in yellow blocks. $\mathbf{a}$ An example from the ELN BH dataset. $\mathbf{b}$ An example from the Denmark dataset.
  • Figure 5: Analysis of predictions generated by BatGPT-Chem. $\mathbf{a}$ Products sampled from the SM, the HTE BH, the AHO, and the BioChem dataset, respectively. $\mathbf{b}$ Numbers of prediction within Top-$k$. $\mathbf{c}$ Details of predictions where green means ground truth is covered and red means not.
  • ...and 8 more figures