Table of Contents
Fetching ...

CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Haining Pan, James V. Roggeveen, Erez Berg, Juan Carrasquilla, Debanjan Chowdhury, Surya Ganguli, Federico Ghimenti, Juraj Hasik, Henry Hunt, Hong-Chen Jiang, Mason Kamb, Ying-Jer Kao, Ehsan Khatami, Michael J. Lawler, Di Luo, Titus Neupert, Xiaoliang Qi, Michael P. Brenner, Eun-Ah Kim

TL;DR

CMT-Benchmark addresses the gap in evaluating expert-level AI capabilities for real scientific tasks by introducing a domain-specific, expert-authored benchmark for condensed matter theory. It comprises 50 original problems across seven analytic and computational methods, with deterministic, automatic grading that handles non-commuting operator algebra. Core contributions include high-value, expert-curated data, robust automated parsing, and detailed performance analyses across frontier models, revealing persistent gaps in physical reasoning. The benchmark aims to guide the development of AI research assistants and tutors capable of rigorous, reproducible scientific reasoning in condensed matter physics.

Abstract

Large language models (LLMs) have shown remarkable progress in coding and math problem-solving, but evaluation on advanced research-level problems in hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a dataset of 50 problems covering condensed matter theory (CMT) at the level of an expert researcher. Topics span analytical and computational approaches in quantum many-body, and classical statistical mechanics. The dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine problems they would want a research assistant to solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte Carlo, density matrix renormalization group (DMRG), quantum/classical statistical mechanics, and model building. We evaluate LLMs by programmatically checking solutions against expert-supplied ground truth. We developed machine-grading, including symbolic handling of non-commuting operators via normal ordering. They generalize across tasks too. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. The best model, GPT5, solves 30\% of the problems; average across 17 models (GPT, Gemini, Claude, DeepSeek, Llama) is 11.4$\pm$2.1\%. Moreover, 18 problems are solved by none of the 17 models, and 26 by at most one. These unsolved problems span Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe this benchmark will guide development toward capable AI research assistants and tutors.

CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

TL;DR

CMT-Benchmark addresses the gap in evaluating expert-level AI capabilities for real scientific tasks by introducing a domain-specific, expert-authored benchmark for condensed matter theory. It comprises 50 original problems across seven analytic and computational methods, with deterministic, automatic grading that handles non-commuting operator algebra. Core contributions include high-value, expert-curated data, robust automated parsing, and detailed performance analyses across frontier models, revealing persistent gaps in physical reasoning. The benchmark aims to guide the development of AI research assistants and tutors capable of rigorous, reproducible scientific reasoning in condensed matter physics.

Abstract

Large language models (LLMs) have shown remarkable progress in coding and math problem-solving, but evaluation on advanced research-level problems in hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a dataset of 50 problems covering condensed matter theory (CMT) at the level of an expert researcher. Topics span analytical and computational approaches in quantum many-body, and classical statistical mechanics. The dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine problems they would want a research assistant to solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte Carlo, density matrix renormalization group (DMRG), quantum/classical statistical mechanics, and model building. We evaluate LLMs by programmatically checking solutions against expert-supplied ground truth. We developed machine-grading, including symbolic handling of non-commuting operators via normal ordering. They generalize across tasks too. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. The best model, GPT5, solves 30\% of the problems; average across 17 models (GPT, Gemini, Claude, DeepSeek, Llama) is 11.42.1\%. Moreover, 18 problems are solved by none of the 17 models, and 26 by at most one. These unsolved problems span Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe this benchmark will guide development toward capable AI research assistants and tutors.

Paper Structure

This paper contains 23 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Problem type distribution and representative example questions in each type.
  • Figure 2: Example questions in CMT-Benchmark by four answer modalities: numerical value, multiple choice, algebraic expressions, and non-commutative operator expressions.
  • Figure 3: Model performance on CMT-Benchmark. (a) Overall success rate on benchmark by model. (b) Success rate per model divided by problem type.