From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Haonan Huang

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Haonan Huang

Abstract

While large language models (LLMs) have transformed AI agents into proficient executors of computational materials science, performing a hundred simulations does not make a researcher. What distinguishes research from routine execution is the progressive accumulation of knowledge -- learning which approaches fail, recognizing patterns across systems, and applying understanding to new problems. However, the prevailing paradigm in AI-driven computational science treats each execution in isolation, largely discarding hard-won insights between runs. Here we present QMatSuite, an open-source platform closing this gap. Agents record findings with full provenance, retrieve knowledge before new calculations, and in dedicated reflection sessions correct erroneous findings and synthesize observations into cross-compound patterns. In benchmarks on a six-step quantum-mechanical simulation workflow, accumulated knowledge reduces reasoning overhead by 67% and improves accuracy from 47% to 3% deviation from literature -- and when transferred to an unfamiliar material, achieves 1% deviation with zero pipeline failures.

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Abstract

Paper Structure (2 sections, 6 figures, 3 tables)

This paper contains 2 sections, 6 figures, 3 tables.

Scale validation.
Cross-engine and agent validation.

Figures (6)

Figure 1: Platform architecture and scale validation.a, QMatSuite architecture showing symmetric access by AI agents (via MCP) and human researchers (via GUI) to the shared core, which contains the knowledge system, provenance tracking, present-tense shared state, and engine dispatch layer. b, Computed versus experimental lattice constants for 114 materials (MAE = 1.02%). c, Computed versus experimental band gaps for 68 non-metallic compounds (MAE = 1.76 eV); 42/42 metals correctly identified; 5 correlated insulators predicted metallic (known PBE limitation, marked $\times$). Inset: 0--2 eV region.
Figure 2: Knowledge transforms a complex workflow.a, Fe AHC learning curve across three runs with 0, 6, and 9 accumulated insights. Gray bars: run_calculation calls; blue line: API reasoning time; red line: AHC error versus literature. b, Episode avoidance matrix showing which pitfalls each run encountered (red) versus avoided via knowledge (green). Yellow: insight in database but not retrieved; gray: not applicable. c, Tool call composition shifting from infrastructure debugging ("Debugger") to physics exploration ("Optimizer").
Figure 3: Knowledge self-corrects and transfers across materials.a, Left: convergence analysis showing the deprecated parameter recommendation (dis_froz_max = 17 eV, red $\times$) as an unconverged outlier, with the upward arrow indicating that Berry mesh refinement coincidentally pulled the AHC toward the literature value (error cancellation). Right: the review session's deprecation reasoning and corrected replacement. b, Ni AHC three-way comparison: 0 insights (14 calculations, 7 failures), 15 unreviewed insights (9 calculations, 5 failures), 21 reviewed insights (3 calculations, 0 failures). c, Ni pitfall avoidance matrix showing progressive transfer; yellow star ($\bigstar$6) marks the dis_froz_max row where unreviewed knowledge provided incorrect advice requiring 3 extra iterations.
Figure 4: Knowledge consolidates through dedicated reflection.a, Three patterns distilled from 25 findings by a dedicated reflection session (Pattern 1: lattice overestimation scaling; Pattern 2: band gap underestimation; Pattern 3: Pulay stress trap), with provenance chains linking each pattern to its source calculations. b, Nudge compliance across 398 execution sessions: 84.9% received recording nudges, 4.0% showed any mid-session recording, 0% produced patterns during execution. All 41 patterns were produced exclusively in dedicated review sessions. c, Knowledge grade evolution across experiments: findings accumulate during execution (25$\to$43), patterns appear only after dedicated reflection (0$\to$3), principles remain an open frontier.
Figure 5: Extended Data Fig. 1 $|$ Temporal structure of agent activity across three Fe AHC runs. Gantt-style visualization of tool call composition, showing the shift from infrastructure debugging to physics exploration as knowledge accumulates. Condition A (0 insights): dominated by debugging and error recovery, with 3.5 hours spent diagnosing why the computed AHC was zero (the starting_magnetization issue). The agent timed out at 6.2 hours. Run 02 (6 insights): proactively avoided the starting_mag pitfall but encountered a novel k-point convention mismatch requiring 8 attempts. Run 03 (9 insights): resolved all setup issues within the first 80 tool calls, then devoted the remainder to a voluntary 7-iteration convergence study exploring disentanglement parameters and adaptive mesh refinement --- a qualitative shift from "getting it to work" to "understanding the physics." Results shown at right demonstrate monotonic accuracy improvement.
...and 1 more figures

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Abstract

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Authors

Abstract

Table of Contents

Figures (6)