Table of Contents
Fetching ...

Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, Juntao Chen

TL;DR

Quantum-Audit presents a comprehensive, multi-format benchmark of quantum computing knowledge for 26 LLMs, comprising 2,700 questions (expert-written, LLM-extracted, and deliberately false-premise/open-ended items) and a multilingual Spanish/French subset. The study reveals strong performance on foundational concepts but substantial weaknesses on advanced topics such as quantum security, and a troubling tendency to accept faulty premises. Agentic and deep-research modes yield meaningful improvements (≈6.7 percentage points on average) without achieving near-perfect accuracy, while human experts outperform most models yet still exhibit variability. These findings underscore both progress and persistent challenges in leveraging LLMs for quantum education and research, and they highlight the need for robust, diverse evaluation frameworks as the field evolves.

Abstract

Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.

Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

TL;DR

Quantum-Audit presents a comprehensive, multi-format benchmark of quantum computing knowledge for 26 LLMs, comprising 2,700 questions (expert-written, LLM-extracted, and deliberately false-premise/open-ended items) and a multilingual Spanish/French subset. The study reveals strong performance on foundational concepts but substantial weaknesses on advanced topics such as quantum security, and a troubling tendency to accept faulty premises. Agentic and deep-research modes yield meaningful improvements (≈6.7 percentage points on average) without achieving near-perfect accuracy, while human experts outperform most models yet still exhibit variability. These findings underscore both progress and persistent challenges in leveraging LLMs for quantum education and research, and they highlight the need for robust, diverse evaluation frameworks as the field evolves.

Abstract

Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
Paper Structure (18 sections, 6 figures, 6 tables)

This paper contains 18 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Distribution of the 2,700 benchmark questions by topic. Expert-written questions include multiple choice, open-ended, and false premise questions, while LLM-assisted questions consist solely of multiple choice. Topics are ordered by total question count to highlight relative coverage across different quantum computing domains.
  • Figure 2: Sample quantum security questions that demonstrate the technical specificity required for this topic area. Highlighted portions indicate concepts requiring knowledge of recent attack research where even leading models show reduced accuracy.
  • Figure 3: Examples of false premise questions where models must identify and correct erroneous assumptions embedded in the question formulation. Highlighted portions indicate the false premises that should be rejected rather than accepted.
  • Figure 4: Performance comparison of selected LLMs across different capability tiers on the QA2000 benchmark against human baselines. The visualization includes 9 representative models ranging from top performers to those scoring below novice human levels. Bars are colored by model provider.
  • Figure 5: Bubble chart of Spanish (horizontal) versus French (vertical) accuracy on the QA500 benchmark. Each bubble's area is proportional to the model's parameter count; colors indicate providers. The diagonal dashed line marks equal performance across the two languages. Bubbles below the line signal larger accuracy loss in Spanish.
  • ...and 1 more figures