Table of Contents
Fetching ...

Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks

S. K. Rithvik

TL;DR

A benchmark for quantum mechanics with automatic verification with automatic verification, systematic evaluation quantifying tier-based performance hierarchies, empirical analysis of tool augmentation trade-offs, and reproducibility characterization are presented.

Abstract

We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproducibility analysis across three runs quantifies 6.3pp average variance, with flagship models demonstrating exceptional stability (GPT-5 achieves zero variance) while specialized models require multi-run evaluation. This work contributes: (i) a benchmark for quantum mechanics with automatic verification, (ii) systematic evaluation quantifying tier-based performance hierarchies, (iii) empirical analysis of tool augmentation trade-offs, and (iv) reproducibility characterization. All tasks, verifiers, and results are publicly released.

Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks

TL;DR

A benchmark for quantum mechanics with automatic verification with automatic verification, systematic evaluation quantifying tier-based performance hierarchies, empirical analysis of tool augmentation trade-offs, and reproducibility characterization are presented.

Abstract

We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\% average accuracy, outperforming mid-tier (77\%) and fast models (67\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\% average, 100\% for flagship models), while numerical computation remains most challenging (42\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproducibility analysis across three runs quantifies 6.3pp average variance, with flagship models demonstrating exceptional stability (GPT-5 achieves zero variance) while specialized models require multi-run evaluation. This work contributes: (i) a benchmark for quantum mechanics with automatic verification, (ii) systematic evaluation quantifying tier-based performance hierarchies, (iii) empirical analysis of tool augmentation trade-offs, and (iv) reproducibility characterization. All tasks, verifiers, and results are publicly released.
Paper Structure (17 sections, 1 equation, 5 figures, 3 tables)

This paper contains 17 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comprehensive Accuracy Analysis. (a) Accuracy by model tier shows clear stratification with flagship models (81.3% avg) outperforming mid-tier (77.0%) and fast models (67.0%) by 4.3pp and 14.3pp respectively. (b) Task category difficulty reveals distinct performance patterns across tiers. (c) Top 10 models span all three tiers, with Claude Sonnet 4 and Qwen3-Max tied at 85.0%, followed by Claude Sonnet 4.5 (83.3%). (d) Individual task difficulty ranges from 11.1% (T2: quantum tunneling) to 97.8% (D1: commutator algebra), showing substantial variance across the 20 tasks.
  • Figure 2: Per-Task Performance Heatmap. (a) 15 models × 20 tasks showing accuracy (0--100%, red-white-green colormap). Models grouped by tier (black horizontal lines separate fast/mid/flagship), tasks grouped by category (vertical black lines separate D/C/N/T). (b) Mean accuracy per task across all models. Tasks reveal dramatic difficulty variation (11.1% to 97.8% mean accuracy).
  • Figure 3: Resource Efficiency Analysis. (a) Cost efficiency: flagship models cost 33× more per task than fast models on average for 14.3pp accuracy gains. (b) Time efficiency: flagship models require 1.6× longer per task on average. (c) Cost-time trade-off: cost and time show tier stratification with substantial within-tier variation. (d) Token efficiency: accuracy versus token usage reveals models with higher token consumption do not necessarily achieve proportionally higher accuracy, suggesting diminishing returns in reasoning verbosity.
  • Figure 4: Tool Augmentation Effects on Numerical Tasks. (a) Overall accuracy comparison between baseline and tool-augmented approaches on T tasks, showing modest +4.4pp improvement. (b) Per-task accuracy breakdown comparing baseline vs. tool-augmented performance across five numerical tasks (T1--T5), revealing heterogeneous effects. (c) Average tool calls per task by model, with top 10 models shown (overall mean: 1.8 calls). (d) Accuracy change distribution by task, showing gains (green) and losses (red): T1 +28.9pp, T3/T4 +6.7pp each, T2 -4.4pp, T5 -15.6pp.
  • Figure 5: Reproducibility Across Three Runs. (a) Standard deviation distribution shows most model-task pairs have moderate variance. (b) Tier-specific variance: fast models 7.4pp avg, mid-tier 6.3pp, flagship 5.3pp. (c) Model-specific variance reveals GPT-5 as perfectly consistent (0pp) while Qwen 2.5 Coder exhibits highest variance (16.1pp). (d) Task category variance: Derivations (D) show lowest variance (5.4pp), Numerical tasks (T) show highest (14.6pp).