Table of Contents
Fetching ...

ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving

Botao Yu, Frazier N. Baker, Ziru Chen, Garrett Herb, Boyu Gou, Daniel Adu-Ampratwum, Xia Ning, Huan Sun

TL;DR

ChemToolAgent investigates whether expanding LLMs with domain-specific tools improves chemistry problem solving. Built on the ReAct framework, CTA uses 29 tools to handle a broad spectrum of tasks, and is evaluated on specialized datasets (e.g., SMolInstruct) and general chemistry benchmarks (MMLU-Chemistry, SciBench-Chemistry, GPQA-Chemistry). Results show substantial gains for specialized, tool-heavy tasks but no consistent advantage over base LLMs on general questions, indicating a nuanced trade-off where tool augmentation helps certain domains but can impede broad reasoning. The study further analyzes error types, revealing that tool-related failures and cognitive load contribute to the mixed performance, guiding future work toward better tool design, reasoning verification, and load management with multi-agent or information-verification strategies.

Abstract

To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemToolAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemToolAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.

ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving

TL;DR

ChemToolAgent investigates whether expanding LLMs with domain-specific tools improves chemistry problem solving. Built on the ReAct framework, CTA uses 29 tools to handle a broad spectrum of tasks, and is evaluated on specialized datasets (e.g., SMolInstruct) and general chemistry benchmarks (MMLU-Chemistry, SciBench-Chemistry, GPQA-Chemistry). Results show substantial gains for specialized, tool-heavy tasks but no consistent advantage over base LLMs on general questions, indicating a nuanced trade-off where tool augmentation helps certain domains but can impede broad reasoning. The study further analyzes error types, revealing that tool-related failures and cognitive load contribute to the mixed performance, guiding future work toward better tool design, reasoning verification, and load management with multi-agent or information-verification strategies.

Abstract

To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemToolAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemToolAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.

Paper Structure

This paper contains 32 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Our ChemToolAgent framework. Upon receiving a user task, the agent iterates through a three-step ReAct process yao2023react: (1) Thought generation, analyzing the current situation and planning subsequent steps; (2) Action determination, selecting the appropriate tool and its input based on the generated thought; and (3) Observation obtaining, executing a tool in the environment and obtaining the results or feedback. This iterative cycle continues until task completion or conclusion, and the final answer is returned to the user.
  • Figure 2: The error statistics of CTA (GPT) on SMolInstruct (102 errors) and MMLU-Chemistry (64 errors).
  • Figure C.1: Tasks in SMolInstruct yu2024llasmol.
  • Figure E.2: The statistics of tool usage by ChemToolAgent (GPT). The cell values represent the ratios of number of samples where the corresponding tools are used out of the total number of samples, and the number "0" indicates the value is 0 (the tool is not used).