Table of Contents
Fetching ...

QuantumBench: A Benchmark for Quantum Problem Solving

Shunya Minami, Tatsuya Ishigaki, Ikko Hamamura, Taku Mikuriya, Youmi Ma, Naoaki Okazaki, Hiroya Takamura, Yohichi Suzuki, Tadashi Kadowaki

TL;DR

QuantumBench targets the gap in evaluating LLMs for quantum science by compiling about 800 undergraduate-level, eight-option MCQs from public materials across nine subfields. It enables systematic cross-model comparisons and sensitivity analyses to question format, providing insights into how model size and reasoning strength affect performance and cost. The results show small- to medium-scale models with moderate reasoning can approach frontier models, while gains from deeper reasoning plateau at higher costs. This benchmark lays the groundwork for more domain-aware evaluation in quantum research and highlights directions for constructing open-ended and procedure-focused assessments in the future.

Abstract

Large language models are now integrated into many scientific workflows, accelerating data analysis, hypothesis generation, and design space exploration. In parallel with this growth, there is a growing need to carefully evaluate whether models accurately capture domain-specific knowledge and notation, since general-purpose benchmarks rarely reflect these requirements. This gap is especially clear in quantum science, which features non-intuitive phenomena and requires advanced mathematics. In this study, we introduce QuantumBench, a benchmark for the quantum domain that systematically examine how well LLMs understand and can be applied to this non-intuitive field. Using publicly available materials, we compiled approximately 800 questions with their answers spanning nine areas related to quantum science and organized them into an eight-option multiple-choice dataset. With this benchmark, we evaluate several existing LLMs and analyze their performance in the quantum domain, including sensitivity to changes in question format. QuantumBench is the first LLM evaluation dataset built for the quantum domain, and it is intended to guide the effective use of LLMs in quantum research.

QuantumBench: A Benchmark for Quantum Problem Solving

TL;DR

QuantumBench targets the gap in evaluating LLMs for quantum science by compiling about 800 undergraduate-level, eight-option MCQs from public materials across nine subfields. It enables systematic cross-model comparisons and sensitivity analyses to question format, providing insights into how model size and reasoning strength affect performance and cost. The results show small- to medium-scale models with moderate reasoning can approach frontier models, while gains from deeper reasoning plateau at higher costs. This benchmark lays the groundwork for more domain-aware evaluation in quantum research and highlights directions for constructing open-ended and procedure-focused assessments in the future.

Abstract

Large language models are now integrated into many scientific workflows, accelerating data analysis, hypothesis generation, and design space exploration. In parallel with this growth, there is a growing need to carefully evaluate whether models accurately capture domain-specific knowledge and notation, since general-purpose benchmarks rarely reflect these requirements. This gap is especially clear in quantum science, which features non-intuitive phenomena and requires advanced mathematics. In this study, we introduce QuantumBench, a benchmark for the quantum domain that systematically examine how well LLMs understand and can be applied to this non-intuitive field. Using publicly available materials, we compiled approximately 800 questions with their answers spanning nine areas related to quantum science and organized them into an eight-option multiple-choice dataset. With this benchmark, we evaluate several existing LLMs and analyze their performance in the quantum domain, including sensitivity to changes in question format. QuantumBench is the first LLM evaluation dataset built for the quantum domain, and it is intended to guide the effective use of LLMs in quantum research.

Paper Structure

This paper contains 17 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visualization of QuantumBench. The questions were embedded into 3072-dimensional vectors using text-embedding-3-large, and mapped onto a two-dimensional space using t-SNE.
  • Figure 2: Distribution of difficulty and expertise levels by category. (a) Difficulty level, ranging from 1 (trivial) to 5 (challenging). (b) Expertise level, ranging from 1 (high school) to 4 (PhD). The graph shows the distribution of problems that were evaluated by at least one annotator.
  • Figure 3: Benchmark results. (a) Accuracy on all 769 problems. Blue indicates open-weight models, and green indicates closed models available only via API. Dark colors denote reasoning models, while light colors denote non-reasoning models. Tags appended to the names of reasoning models indicate the strength of reasoning. (b) Relationship between the number of model parameters and accuracy. Only open-weight models with publicly available parameter counts are shown. Blue circles represent reasoning models, and light-blue squares represent non-reasoning models. (c) Transition of accuracy when varying the reasoning strength for reasoning models. The “minimal” setting is available only for some models. (d) Relationship between average API usage cost per problem and accuracy. Only closed models, which require API usage, are shown. Green circles represent reasoning models, and light-green circles represent non-reasoning models.
  • Figure 4: Category-wise accuracy of the LLMs. (a) Accuracy by question domain. The numbers attached to the axis labels indicate the number of questions in each domain. (b) Accuracy by question type.
  • Figure 5: Accuracy of LLMs across (a) difficulty and (b) expertise levels. Each gray line shows the accuracy of an individual model, and the blue line shows the average accuracy across all models.
  • ...and 3 more figures