Table of Contents
Fetching ...

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, Yuqiang Li

TL;DR

QCBench addresses the gap in quantitative chemistry evaluation by introducing a $350$-item benchmark spanning seven subfields and three difficulty levels to systematically assess LLMs’ numerical reasoning. The framework combines expert-curated and existing benchmark problems, robust data processing, leakage checks, and dual answer verification (strict xVerify and tolerance-based) to provide nuanced diagnostics of computational capabilities. Key findings show a persistent gap between linguistic fluency and computational accuracy, with performance degrading as task difficulty increases and substantial ranking decoupling between QCBench and broader chemistry benchmarks. The work also reveals that strict verification can act as a self-correction mechanism for advanced models, guiding future domain-adaptive tuning and potential multi-modal integrations to improve quantitative chemistry reasoning.

Abstract

Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.

QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

TL;DR

QCBench addresses the gap in quantitative chemistry evaluation by introducing a -item benchmark spanning seven subfields and three difficulty levels to systematically assess LLMs’ numerical reasoning. The framework combines expert-curated and existing benchmark problems, robust data processing, leakage checks, and dual answer verification (strict xVerify and tolerance-based) to provide nuanced diagnostics of computational capabilities. Key findings show a persistent gap between linguistic fluency and computational accuracy, with performance degrading as task difficulty increases and substantial ranking decoupling between QCBench and broader chemistry benchmarks. The work also reveals that strict verification can act as a self-correction mechanism for advanced models, guiding future domain-adaptive tuning and potential multi-modal integrations to improve quantitative chemistry reasoning.

Abstract

Quantitative chemistry is central to modern chemical research, yet the ability of large language models (LLMs) to perform its rigorous, step-by-step calculations remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry oriented benchmark comprising 350 computational chemistry problems across 7 chemistry subfields, which contains analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry. To systematically evaluate the mathematical reasoning abilities of large language models (LLMs), they are categorized into three tiers: easy, medium, and difficult. Each problem, rooted in realistic chemical scenarios, is structured to prevent heuristic shortcuts and demand explicit numerical reasoning. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain-adaptive fine-tuning or multi-modal integration. Evaluations on 24 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy. Code for QCBench is available at https://github.com/jiaqingxie/QCBench.

Paper Structure

This paper contains 40 sections, 2 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Quantitative questions ratio is relatively small in some chemistry benchmarks. Also, performance is relatively high for reasoning tasks as observed by ChemBench mirza2024large, which motivates us to curate a pure computing benchmark. Original figure created by the authors; not previously published.
  • Figure 2: Qualitative chemistry versus predictive chemistry versus quantitative chemistry. We define quantitative chemistry problems as those requiring objective, numerical answers that can be automatically evaluated by tools such as xVerify chen2025xverify. In contrast, qualitative problems involve textual explanations or conceptual reasoning without definitive ground-truth answers. Predictive chemistry includes simple counting tasks, such as tallying hydrogen atoms, or interpreting reaction equations, as they lack formal computational steps and do not require formula-based reasoning.
  • Figure 3: Framework of QCBench.
  • Figure 4: Approximation and Exact Matching (by xVerify) Accuracies (%, $\uparrow$) of models across 7 chemistry subfields, with columns for Approximate (Appr.) and xVerify results. Bold indicates the best score per column, underlined indicates the second-best (excluding ties). Our experiments employ xVerify-0.5B-I. This is the result for the balanced benchmark dataset, with 95% confidence interval
  • Figure 5: Approximation and Exact Matching (by xVerify) Accuracies (%, $\uparrow$) of models across 7 chemistry subfields, with columns for Approximate (Appr.) and xVerify results. Bold indicates the best score per column, underlined indicates the second-best (excluding ties). Our experiments employ xVerify-0.5B-I. This is the result for the full benchmark dataset, with 95% confidence interval
  • ...and 9 more figures