Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Hao Li; He Cao; Bin Feng; Yanjun Shao; Xiangru Tang; Zhiyuan Yan; Li Yuan; Yonghong Tian; Yu Li

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, Yu Li

TL;DR

ChemCoTBench addresses the need for structured chemical reasoning assessment by representing molecular transformations as modular operations and evaluating LLMs on Molecular Property Optimization and Reaction Prediction. It provides a data-rich ChemCoTDataset and a taxonomy of reasoning steps, enabling slow-thinking, verifiable problem solving in chemistry. Experiments show that although reasoning-enabled models outperform non-reasoning ones on challenging chemical tasks, open-source models still lag due to limited domain-specific data, and domain-specific CoT data augmentation substantially improves performance. The work offers a practical benchmark and dataset to advance AI-assisted chemical discovery.

Abstract

While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

TL;DR

Abstract

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)