Table of Contents
Fetching ...

Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

Abhinaba Basu, Pavan Chakraborty

Abstract

Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.

Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

Abstract

Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.
Paper Structure (25 sections, 1 equation, 13 figures, 5 tables)

This paper contains 25 sections, 1 equation, 13 figures, 5 tables.

Figures (13)

  • Figure 1: The PCM pipeline. Stage 1: automated adversaries (six strategies including LLMs) propose compositional feature vectors; the MLIP oracle evaluates each against the DFT reference. Stage 2: counterexamples refine the safety envelope with bootstrap CIs. Stage 3: the envelope compiles into Lean 4 proofs with explicit axioms.
  • Figure 2: Cross-MLIP comparison of three architecturally distinct MLIPs on 5,000 WBM-derived structures. a, CHGNet vs MACE max force ($r = 0.13$). b, Pairwise force correlation heatmap: all three pairs near zero (CHGNet--TensorNet $r = 0.10$, CHGNet--MACE $r = 0.13$, TensorNet--MACE $r = -0.01$). c, Failure rates: CHGNet 31.1%, TensorNet 75.7%, MACE 73.2%. d, Architecture-specific blind spots with largely disjoint failure chemistries.
  • Figure 3: Adversary strategy comparison (10 configurations, budget = 200). a, CX rate with Wilson 95% CI: all strategies achieve ${>}85\%$ against the 93.2% base rate. b, Unique materials discovered: algorithmic strategies find 61--138 unique compositions; LLM adversaries find 5--29 but concentrate on functionally important materials. c, Exploration heatmap: LLMs converge on high-$Z$, multi-element regions (top) while baselines spread uniformly (bottom).
  • Figure 4: Systematic failure patterns. a, Anomaly distribution: prediction error vs max force for 5,000 WBM materials (grey) with adversarially discovered failures (red). b, Per-element stability disagreement by periodic table block: f-block elements fail most ($p = 0.042$). c, JARVIS cross-functional validation: 682 CHGNet blind spots confirmed by independent DFT.
  • Figure Extended Data Fig. 1: Dual-adversary Venn diagram. Two adversary strategies discover complementary materials with 14.3% overlap (3 consensus materials). Complementary search strategies exploit different regions of compositional space.
  • ...and 8 more figures