Table of Contents
Fetching ...

Expert Evaluation of LLM World Models: A High-$T_c$ Superconductivity Case Study

Haoyu Guo, Maria Tikhanovskaya, Paul Raccuglia, Alexey Vlaskin, Chris Co, Daniel J. Liebling, Scott Ellsworth, Matthew Abraham, Elizabeth Dorfman, N. P. Armitage, Chunhan Feng, Antoine Georges, Olivier Gingras, Dominik Kiese, Steven A. Kivelson, Vadim Oganesyan, B. J. Ramshaw, Subir Sachdev, T. Senthil, J. M. Tranquada, Michael P. Brenner, Subhashini Venugopalan, Eun-Ah Kim

TL;DR

This work constructs an expert-curated database of 1,726 scientific papers, and a set of 67 expert-formulated questions that probe deep understanding of the literature, and evaluates six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text.

Abstract

Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high-temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert-curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert-formulated questions that probe deep understanding of the literature. We then evaluate six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the answers of these systems against a rubric that assesses balanced perspectives, factual comprehensiveness, succinctness, and evidentiary support. Among the six systems two using RAG on curated literature outperformed existing closed models across key metrics, particularly in providing comprehensive and well-supported answers. We discuss promising aspects of LLM performances as well as critical short-comings of all the models. The set of expert-formulated questions and the rubric will be valuable for assessing expert level performance of LLM based reasoning systems.

Expert Evaluation of LLM World Models: A High-$T_c$ Superconductivity Case Study

TL;DR

This work constructs an expert-curated database of 1,726 scientific papers, and a set of 67 expert-formulated questions that probe deep understanding of the literature, and evaluates six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text.

Abstract

Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high-temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert-curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert-formulated questions that probe deep understanding of the literature. We then evaluate six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the answers of these systems against a rubric that assesses balanced perspectives, factual comprehensiveness, succinctness, and evidentiary support. Among the six systems two using RAG on curated literature outperformed existing closed models across key metrics, particularly in providing comprehensive and well-supported answers. We discuss promising aspects of LLM performances as well as critical short-comings of all the models. The set of expert-formulated questions and the rubric will be valuable for assessing expert level performance of LLM based reasoning systems.

Paper Structure

This paper contains 10 sections, 4 figures.

Figures (4)

  • Figure 1: (a) Flow diagram showing the database building process and how the LLMs are evaluated. We curated a literature database based on references of review articles recommended by the expert panel. We also collected questions related to the topic of high-$T_c$ cuprates from the expert panel. The LLMs were prompted to answer these questions and the outputs were graded by the expert panel. (b) Composition of the curated literature database. The database contains 3279 papers, and is classified into theoretical papers (green) and experimental papers (blue and orange). All the theoretical papers and about half of the experimental papers are openly available on arXiv. The other half of the experimental papers (961 papers) were obtained from the publisher. A total of 1726 experimental papers are used in the study. (c) Examples of the question database. (d) The metrics that the expert panel used to evaluate the LLM outputs.
  • Figure 2: (a) Physical concepts that are involved in the question database and their countings. Each question can be related to multiple concepts. Abbreviations used: ARPES (angle-resolved photoemission), FSR (Fermi surface reconstruction), STM (scanning tunneling microscope), NMR (nuclear magnetic resonance), MR (magnetoresistance), SC (superconductivity), SQUID (superconducting quantum interference device), ADMR (angle-dependent magnetoresistance), $\chi_m$ (magnetic susceptibility), $\mu$SR (muon spin rotation/relaxation), $\sigma(\omega)$ (optical conductivity), EPI (electron-phonon interaction), SB (symmetry breaking), PD (penetration depth), QC (quantum criticality), DS (diamagnetic susceptibility), $\rho_s$ (superfluid stiffness), QO (quantum oscillation). (b) A prompt that queries about one question of the database. (c) An excerpt of the response to the prompt in (b) from System 5 (NotebookLM), which bases its answer on the curated literature database and is instructed to provide multiple perspectives. (d) An excerpt of the response to the prompt in (b) from System 6 (custom), which bases its answer on the curated literature database and is able to provide figure references. The figures are reprinted from https://doi.org/10.1038/s41586-019-0932-x [Ref.BMichon2019] with permission from Springer Nature. The responses in (c,d) are trimmed for presentation, and the full response is included in the SI. (e) Perspectives that the expert panel expected to address the question in (b). The underlined perspectives are mentioned in the LLM responses.
  • Figure 3: (a-e): Mean scores and standard errors of the 6 models in 5 aspects: (a) Balanced perspective; (b) Factually Comprehensive; (c) Succintness; (d) Supported by Evidences; (e) Relevance of Image. (f): The number of grades that enter into the statistics of results in (a-e).
  • Figure 4: Two examples showing the expectations of visual reasoning capabilities for future LLMs. (a,c) are example questions from the question database. (b,d) are the expected responses, in which the LLMs are expected to surface relevant images and reason based on the contents of the image. The figure in (b) is reprinted from https://www.science.org/doi/10.1126/science.1066974 [Ref. JEHoffman2002d] with permission from American Association for the Advancement of Science. The figure in (d) is reprinted from https://link.aps.org/doi/10.1103/PhysRevB.64.224519[Ref. wang_onset_2001] with permission from American Physical Society.