Table of Contents
Fetching ...

Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

Ken Deng, Xiangfei Wang, Guijing Duan, Chen Mo, Junkun Huang, Runqing Zhang, Ling Qian, Zhiguo Huang, Jize Han, Di Luo

Abstract

Recent advances in automated scientific discovery have shown remarkable promise across frontier research domains, with agent systems driven by large language models (LLMs) emerging as powerful tools for physics research. However, in practical applications, LLM scientific research is prone to hallucinations, highlighting the need for reliable verification and error-correction mechanisms. Here we introduce PhysVEC, an automated multi-agent framework for verifiable and error-correcting AI-driven physics research. PhysVEC incorporates a programming verifier and a scientific verifier to ensure both coding correctness and physical validity, and provides human-auditable evidence at each stage. We curate QMB100, an end-to-end research-level benchmark dataset consisting of $100$ tasks extracted from $21$ high impact articles that focus on quantum many-body physics. We evaluated PhysVEC with four frontier LLMs and found that it significantly outperformed baselines in both programming tests and scientific tests across all LLMs and task categories. PhysVEC demonstrates effective inference-time scaling and delivers accurate physical predictions through integrated verification and error-correction mechanisms, paving the way for reliable and interpretable AI physicists.

Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

Abstract

Recent advances in automated scientific discovery have shown remarkable promise across frontier research domains, with agent systems driven by large language models (LLMs) emerging as powerful tools for physics research. However, in practical applications, LLM scientific research is prone to hallucinations, highlighting the need for reliable verification and error-correction mechanisms. Here we introduce PhysVEC, an automated multi-agent framework for verifiable and error-correcting AI-driven physics research. PhysVEC incorporates a programming verifier and a scientific verifier to ensure both coding correctness and physical validity, and provides human-auditable evidence at each stage. We curate QMB100, an end-to-end research-level benchmark dataset consisting of tasks extracted from high impact articles that focus on quantum many-body physics. We evaluated PhysVEC with four frontier LLMs and found that it significantly outperformed baselines in both programming tests and scientific tests across all LLMs and task categories. PhysVEC demonstrates effective inference-time scaling and delivers accurate physical predictions through integrated verification and error-correction mechanisms, paving the way for reliable and interpretable AI physicists.

Paper Structure

This paper contains 19 sections, 8 figures.

Figures (8)

  • Figure 1: PhysVEC framework and QMB100 dataset. (a) PhysVEC: a multi-agent AI physicist for quantum many-body simulations with self-verification and automated error correction design. (b) QMB100: a benchmark dataset comprising $100$ figures drawn from $21$ high impact articles, covering numerical studies that can be implemented with ITensors, NetKet, Qiskit, and ORCA.
  • Figure 2: Verification and error correction via unit tests and integration tests. In the generated script, the Author agent defines a set of element functions that are subsequently called to perform the computations (gray block). Then the Programming verifier conducts unit tests (blue block) and integration tests (green block) for all element functions. The Programming verifier aggregates the reports and implements corrections accordingly (red blocks).
  • Figure 3: Results of programming test. (a) Comparison of the executability of our framework (PhysVEC) with other three baselines (PhysVEC-1-shot, ReAct-RAG and ReAct). (b) (c) The accuracy in unit tests and integration tests before (hollow markers, PhysVEC-1-shot) and after (solid markers, PhysVEC) the iterative verification and error correction. Integration tests are not applicable to DFT input files.
  • Figure 4: The efficiency of token consumption and tool usage. (a) The $S/T_i$ (input token efficiency), $S/T_o$ (output token efficiency) and $S/T_c$ (tool usage efficiency) of LLMs under different topics. The horizontal dashed line indicates the average performance of $4$ models within each topic. (b) The marginal utility of tool usage (mainly retrieval from official repositories or manuals) between PhysVEC (ours) and ReAct-RAG (baseline). The dashed line represents the marginal benefit of using the retrieval tools in ReAct-RAG. Solid circles above the dashed lines indicate higher tool-use efficiency of PhysVEC. (c) Inference-time scaling via repeated trials on $8$ failed tasks in nqs of Gemini 2.5 Flash. The pass rate, defined as the fraction of tasks that pass at least once within the first $N$ independent trials, increases with repeated trials.
  • Figure 5: Performance in scientific test on QMB100 subset. (a) Comparison of the performance between PhysVEC and baselines on the QMB100 subset. The vertical axis (completed tasks) denotes the number of successful tasks after the scientific test. (b) Cumulative number of completed tasks as a function of iteration in the scientific test under the PhysVEC framework.
  • ...and 3 more figures