ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Xiangru Tang; Tianyu Hu; Muyang Ye; Yanjun Shao; Xunjian Yin; Siru Ouyang; Wangchunshu Zhou; Pan Lu; Zhuosheng Zhang; Yilun Zhao; Arman Cohan; Mark Gerstein

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein

TL;DR

ChemAgent tackles the difficulty of chemical reasoning in LLMs by introducing a self-updating memory library organized into Planning, Execution, and Knowledge memories that stores decomposed sub-tasks and their solutions. The framework supports memory-based retrieval, refinement, and runtime evolution, enabling progressive improvement as the model encounters more problems. Across SciBench datasets and multiple backbones (GPT-3.5, GPT-4, Llama3, Qwen), ChemAgent yields substantial accuracy gains, particularly with stronger models, and demonstrates the value of memory quality and structured sub-task decomposition. The work shows strong potential for scaling to drug discovery and materials science, with open-source code to encourage broader adoption and extension.

Abstract

Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 20 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 20 figures, 6 tables, 1 algorithm.

Introduction
Method
Preliminaries
Composition of the Library
Decomposition as Atomic Blocks
Library Construction
Library-Enhanced Reasoning
Evaluate & Refine Module
Setup
Results
Self-evolution during runtime
Cost Analysis
Error Analysis
Ablation Study
Memory Component Analysis
...and 18 more sections

Figures (20)

Figure 1: Comparison of problem-solving approaches for a hydrogen atom energy transition problem. The figure illustrates three different methods: (a) shows a standard Chain-of-Thought approach with calculation errors (in steps 3 and 4) in wang2023scibench. (b) demonstrates the StructChem ouyang2024structured method with formula generation and step-by-step reasoning but fails due to an incorrect constant and incorrect unit conversion (in steps 1 and 4). (c) presents the ChemAgent solution, featuring task decomposition, memory retrieval from the library, and reasoning, leading to the accurate final answer.
Figure 2: The diagram of our overall framework. It contains (a) library-enhanced reasoning and (b) library construction. (a) illustrates how ChemAgent utilizes the library to address a new task for the test set. And (b) demonstrates the construction of the library over the dev set, including Plan Memory $\mathcal{M}_p$ and Execution Memory $\mathcal{M}_e$).
Figure 3: Given a task $\mathcal{P}$, the relevant memory examples are provided in the library. Specifically, while Execution Memory ($\mathcal{M}_e$) and Plan Memory ($\mathcal{M}_p$) are derived from prior experiences, Knowledge memory ($\mathcal{M}_k$) is generated by LLM based on the problem prompt. The conditions $\mathcal{C}$ are not explicitly presented here but are embedded within $\mathcal{P}$ and the [GOAL] of $\mathcal{M}_e$.
Figure 4: Overall framework of the evaluation & refinement module. ChemAgent continuously modifies the solution or the comprehensive strategy until it either reaches the maximum number of trials or meets the evaluator's criteria.
Figure 5: Self-evolving analysis. We test ChemAgent twice for each iteration, and the difference between the two results serves as the error margin. All the experiments here are done on MATTER dataset.
...and 15 more figures

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

TL;DR

Abstract

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (20)