Table of Contents
Fetching ...

From Phase Prediction to Phase Design: A ReAct Agent Framework for High-Entropy Alloy Discovery

Iman Peivaste, Salim Belouettar

TL;DR

This work establishes LLM-guided agentic reasoning as a principled, transparent, and manifold-aware complement to gradient-free optimisation for inverse alloy design.

Abstract

Discovering high-entropy alloy (HEA) compositions that reliably form a target crystal phase is a high-dimensional inverse design problem that conventional trial-and-error experimentation and forward-only machine learning models cannot efficiently solve. Here we present a ReAct (Reasoning + Acting) LLM agent that autonomously proposes, validates, and iteratively refines HEA compositions by querying a calibrated XGBoost surrogate trained on 4,753 experimental records across four phases (FCC, BCC, BCC+FCC, BCC+IM), achieving 94.66\% accuracy (F1 macro = 0.896). Against Bayesian optimisation (BO) and random search baselines, the full-prompt agent achieves descriptor-space rediscovery rates of 38\%, 18\%, and 38\% for FCC, BCC, and BCC+FCC (Mann--Whitney $p \leq 0.039$), with proposals lying 2.4--22.8$\times$ closer to the experimental phase manifold than random search. An ablation reveals that domain priors shift the agent from landmark-alloy recall toward compositionally diverse exploration -- an uninformed agent scores higher rediscovery by concentrating on literature-dense families, while the full-prompt agent explores underrepresented space (unique ratio 1.0 vs.\ 0.39 for BCC+FCC). These regimes represent distinct criteria: proximity to known literature versus genuine discovery. Spearman analysis confirms agent reasoning is statistically aligned with empirical phase distributions ($ρ= 0.736$, $p = 0.004$ for BCC). This work establishes LLM-guided agentic reasoning as a principled, transparent, and manifold-aware complement to gradient-free optimisation for inverse alloy design.

From Phase Prediction to Phase Design: A ReAct Agent Framework for High-Entropy Alloy Discovery

TL;DR

This work establishes LLM-guided agentic reasoning as a principled, transparent, and manifold-aware complement to gradient-free optimisation for inverse alloy design.

Abstract

Discovering high-entropy alloy (HEA) compositions that reliably form a target crystal phase is a high-dimensional inverse design problem that conventional trial-and-error experimentation and forward-only machine learning models cannot efficiently solve. Here we present a ReAct (Reasoning + Acting) LLM agent that autonomously proposes, validates, and iteratively refines HEA compositions by querying a calibrated XGBoost surrogate trained on 4,753 experimental records across four phases (FCC, BCC, BCC+FCC, BCC+IM), achieving 94.66\% accuracy (F1 macro = 0.896). Against Bayesian optimisation (BO) and random search baselines, the full-prompt agent achieves descriptor-space rediscovery rates of 38\%, 18\%, and 38\% for FCC, BCC, and BCC+FCC (Mann--Whitney ), with proposals lying 2.4--22.8 closer to the experimental phase manifold than random search. An ablation reveals that domain priors shift the agent from landmark-alloy recall toward compositionally diverse exploration -- an uninformed agent scores higher rediscovery by concentrating on literature-dense families, while the full-prompt agent explores underrepresented space (unique ratio 1.0 vs.\ 0.39 for BCC+FCC). These regimes represent distinct criteria: proximity to known literature versus genuine discovery. Spearman analysis confirms agent reasoning is statistically aligned with empirical phase distributions (, for BCC). This work establishes LLM-guided agentic reasoning as a principled, transparent, and manifold-aware complement to gradient-free optimisation for inverse alloy design.
Paper Structure (28 sections, 6 figures, 5 tables)

This paper contains 28 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The forward prediction pipeline (top) maps a proposed HEA composition to class probabilities via 13 mixture-rule descriptors and a calibrated XGBoost surrogate trained on 5,677 cleaned experimental records (4,753 retained after restricting to four target classes; see in section 2.1). The inverse design loop (bottom, red arrows) shows the ReAct agent iteratively proposing compositions, validating them, querying the surrogate, and optionally delegating to the Bayesian optimisation module when reasoning stalls.
  • Figure 2: (a) Normalised confusion matrix on the held-out test set (n = 476 records). Diagonal values are per-class recall; overall accuracy is 94.66%. (b) Isotonic-regression calibration reliability diagram for BCC, FCC, and BCC+FCC. BCC+IM is omitted due to insufficient test samples per probability bin (n = 145 total). The FCC+BCC curve shows moderate miscalibration in the 0.3--0.6 probability range, attributable to sparse test samples in that interval; the agent's decisions are gated at $P > 0.80$, where all classes show reliable calibration.
  • Figure 3: Rediscovery rate by method and target phase (mean $\pm$ std, $n = 10$ runs). Neither BO nor random search achieves meaningful rediscovery for any phase. Significance markers indicate one-sided Mann--Whitney $U$ test (agent $>$ baseline): $^{**} p < 0.01$ (FCC: $p < 0.001$, BCC+FCC: $p = 0.003$); $^{*} p < 0.05$ (BCC: $p = 0.039$).
  • Figure 4: Principal component analysis (PCA) was fit on the scaled 13-descriptor training set, and both held-out test compositions (small, transparent markers) and agent-proposed compositions from predict phase calls (larger markers) were projected into the same 2D space. Colors indicate predicted phase class (FCC, BCC, BCC+FCC, BCC+IM). The visualization highlights overlap and separation trends between known compositions and generated proposals across phase regions. Note that proximity in this 2D PCA projection does not directly correspond to rediscovery distances, which are computed in the full 13-dimensional scaled descriptor space.
  • Figure 5: Convergence of $P(\text{target phase})$ per surrogate call (mean $\pm$ std, $n = 10$ runs). (a) FCC: the agent achieves mean $P > 0.97$ from the first call, reflecting accurate domain priors. (b) BCC: the agent rises from $\sim 0.60$ to $\sim 0.88$ over 20 calls through iterative surrogate feedback. (c) BCC+FCC: the random baseline frequently fails to locate the target stability region (mean best $P = 0.591 \pm 0.253$), while the agent and BO reliably do so. BO uses random initialisation for the convergence comparison (see section 2.5).
  • ...and 1 more figures