QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry
Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, Yuqiang Li
TL;DR
QCBench addresses the gap in quantitative chemistry evaluation by introducing a $350$-item benchmark spanning seven subfields and three difficulty levels to systematically assess LLMs’ numerical reasoning. The framework combines expert-curated and existing benchmark problems, robust data processing, leakage checks, and dual answer verification (strict xVerify and tolerance-based) to provide nuanced diagnostics of computational capabilities. Key findings show a persistent gap between linguistic fluency and computational accuracy, with performance degrading as task difficulty increases and substantial ranking decoupling between QCBench and broader chemistry benchmarks. The work also reveals that strict verification can act as a self-correction mechanism for advanced models, guiding future domain-adaptive tuning and potential multi-modal integrations to improve quantitative chemistry reasoning.
Abstract
Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.
