Automated Retrosynthesis Planning of Macromolecules Using Large Language Models and Knowledge Graphs
Qinyu Ma, Yuhao Zhou, Jianfeng Li
TL;DR
The paper addresses the challenge of identifying reliable synthesis pathways for macromolecules, where nomenclature and data gaps hinder retrosynthesis. It introduces an end-to-end agent that combines large language models (LLMs) with knowledge graphs (KGs) to automate literature retrieval, reaction data extraction, database querying, and retrosynthetic tree construction, augmented by a Multi-branched Reaction Pathway Search (MBRPS) and memoized DFS (MDFS). A Chain-of-Thought (CoT) framework guides the evaluation of candidate pathways, enabling ranking by practical criteria such as availability, cost, mildness, yield, scalability, and safety. Demonstrated on polyimide, the approach yields hundreds of both known and novel pathways, illustrating potential for broader macromolecular synthesis planning and faster materials discovery, with code available for reproducibility.
Abstract
Identifying reliable synthesis pathways in materials chemistry is a complex task, particularly in polymer science, due to the intricate and often non-unique nomenclature of macromolecules. To address this challenge, we propose an agent system that integrates large language models (LLMs) and knowledge graphs. By leveraging LLMs' powerful capabilities for extracting and recognizing chemical substance names, and storing the extracted data in a structured knowledge graph, our system fully automates the retrieval of relevant literatures, extraction of reaction data, database querying, construction of retrosynthetic pathway trees, further expansion through the retrieval of additional literature and recommendation of optimal reaction pathways. By considering the complex interdependencies among chemical reactants, a novel Multi-branched Reaction Pathway Search Algorithm (MBRPS) is proposed to help identify all valid multi-branched reaction pathways, which arise when a single product decomposes into multiple reaction intermediates. In contrast, previous studies were limited to cases where a product decomposes into at most one reaction intermediate. This work represents the first attempt to develop a fully automated retrosynthesis planning agent tailored specially for macromolecules powered by LLMs. Applied to polyimide synthesis, our new approach constructs a retrosynthetic pathway tree with hundreds of pathways and recommends optimized routes, including both known and novel pathways. This demonstrates utilizing LLMs for literature consultation to accomplish specific tasks is possible and crucial for future materials research, given the vast amount of materials-related literature.
