Quantum Many-Body Physics Calculations with Large Language Models

Haining Pan; Nayantara Mudur; Will Taranto; Maria Tikhanovskaya; Subhashini Venugopalan; Yasaman Bahri; Michael P. Brenner; Eun-Ah Kim

Quantum Many-Body Physics Calculations with Large Language Models

Haining Pan, Nayantara Mudur, Will Taranto, Maria Tikhanovskaya, Subhashini Venugopalan, Yasaman Bahri, Michael P. Brenner, Eun-Ah Kim

TL;DR

The authors show that, guided by carefully designed prompts, LLM can achieve high accuracy in carrying out analytical calculations in theoretical physics - the derivation of Hartree-Fock equations - with an average score of 87.5 in GPT-4 across calculation steps from recent research papers.

Abstract

Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for developing algorithms that automatically explore theoretical hypotheses at an unprecedented scale.

Quantum Many-Body Physics Calculations with Large Language Models

TL;DR

Abstract

Paper Structure (4 figures)

This paper contains 4 figures.

Figures (4)

Figure 1: (a) The five conceptual steps of the derivation of the HF Hamiltonian and self-consistency equations and bite-sized tasks within each step. The HF template consists of the prompt template $T_i$ for each task $i$. (b) An example template $T_3$. The placeholders are highlighted. We turn the template into a prompt for the task $3$ by specifying the placeholders for the given paper in the database. (c) The schematic for generating the prompts from the template with placeholders (empty boxes) using human-supplied information (boxes with dots). (d) The schematic for generating the prompts from an abstract. We give an abstract to a LLM and query the LLM to infer system specific information from the abstract and fill relevant placeholders in the template. Since notations are not specified in the abstract, we supply placeholders corresponding to the notations. The combination is a complete prompt. (e) An example of a query asking an LLM to infer system specific information. (f) An example response from GPT-4 to the query of panel (e). We required the response to consist of the quote, an explanation, and the answer. The answers are highlighted. (g) An example of a response to the final prompt by GPT-4 for Ref. pan2022topological corresponding to $T_{11}$.
Figure 2: (a) The execution workflow using the full prompt set based on the HF template. Each prompt builds on the outputs of all the previous steps. Specifically, the prompt for the task $i$, $P_i$ incorporates the corrected output $O_{i-1}$ of the previous prompt. (b) The schematic of evaluation and correction for each task $i$. Each output $O_i$ to the prompt $P_i$ executing the task $i$ is evaluated by the human evaluator and corrected, if necessary. The verified output $O_i^*$ is incorporated into the next prompt $P_i$. (c) An example of the prompt $P_5$ for reproducing the calculations in Ref. pan2022topological. (d) An example of the execution outcome $O_5$. This output is correct, hence correction was not necessary and $O_i=O_i^*$.
Figure 3: (a) The extraction prompt. This prompt is supplied to GPT-4 together with an excerpt $E_i$ and the HF template $T_ i$. The prompt instructs the LLM to locate the placeholders in the template and replace the dummy labels with information it extracts from the excerpt. The output will be an execution prompt $P_i$. (b) Schematic of excerpt-based information extraction using the prompt in (a). (c-e) Mean and standard error on the mean of the score for placeholder completion for a subset of placeholders, organized by the type of information associated with the placeholders: Information specific to the nature of the system (c); notation explicitly present in the excerpt (d); notation that needs to be inferred (e). (f) Comparison of the extraction in zero-shot and one-shot scenarios, using the mean performance over five papers to define the bar's length and the standard error of the mean for the error bar.
Figure 4: (a) The four-layered rubric system for evaluating an LLM's output $O_i$ in response to each prompt $P_i$. Adherence: how closely the LLM adheres to the instructions. Rigor: how accurate is the mathematical derivation. Knowledge: how consistent is the LLM's reasoning with the laws of physics. Correctness: how correct is the LLM's response. (b) The rubric-dependence of the performance. The average score for each rubric layer across all outputs for all papers and their standard deviations. (c) The task-dependence of the performance. We averaged the score for each prompt across the four rubric layers. Then these average scores were averaged over the prompts belonging to each step of the derivation as broken down in Fig. \ref{['fig:template']}(a). (d) The paper-dependence of the performance on information extraction and execution. The average score across all the placeholders for a given paper over the excerpt-based information extraction detailed in Fig. \ref{['fig:extraction']} is shown in the lighter sage green. The average score across all rubric layers and prompts for deriving the $H_{HF}$ for a given paper is shown in darker olive green. For both extraction and execution, the error bars were calculated by averaging over all papers and placeholders/tasks. The dashed line between arXiv:2108.02159 and arXiv:2110.11330 marks the separation between papers before and after the training data cutoff date. (e) The dependence of the execution score, for each rubric layer, on the degree of the overlap between the correct output $O_i^*$ and the text of the target research paper.