AI for Science: March 2026 Week 12 (Mar 16–22)

Mar 16 – Mar 22, 2026 · 87 papers analyzed · 4 breakthroughs

Summary

Week 12 yielded ~210 papers across AI4Math and AI4Physics queries. 4 breakthroughs identified: (1) 2603.16770 introduces Garnet, the first force field trained entirely from scratch for proteins AND small molecules from quantum-mechanical data, outperforming OpenFF/Espaloma across SPICE benchmarks; (2) 2603.14775 extends neural network backflow to ab-initio periodic solids, achieving coupled-cluster accuracy on crystalline systems; (3) 2603.19514 shows LLMs fine-tuned to generate formal counterexamples outperform GPT-o1/Gemini on all Pass@k metrics, addressing the long-neglected disproving half of mathematical reasoning; (4) 2603.15617 introduces HorizonMath, the first benchmark of genuinely unsolved research-level math problems with automatic verification, enabling honest measurement of AI mathematical discovery. Notable: 2603.15712 (LLM+RAG discovers high-entropy OER catalysts with record 0.362V overpotential), 2603.19329 (hierarchical proof search achieves state-of-the-art code verification in Lean), 2603.17216 (synthetic task scaling for AI scientist training), 2603.19782 (embodied agentic AI closes the experimental discovery loop).

Key Takeaway

The AI4Science frontier this week is defined by two convergences: neural methods reaching QM accuracy for materials simulation (Garnet, neural backflow), and mathematical AI finally grappling with the neglected disproving problem — together signaling that AI is beginning to contribute to science rather than merely accelerate it.

Breakthroughs (4)

1. Training a force field for proteins and small molecules from scratch

Why Novel: All prior general-purpose force fields rely on hand-crafted functional forms and empirical parameter fitting. Garnet replaces the entire parameterization pipeline with neural network learning directly from quantum chemistry, covering both macromolecular and drug-like chemical space in a single model.

Key Innovations:

[object Object]
[object Object]

Evidence:

— undefined
— undefined

Impact: If robust in production MD, Garnet could retire classical force field fitting workflows and unify small-molecule and protein simulation under a single trainable model.

2. Neural network backflow for ab-initio solid calculations

Why Novel: Prior neural-network quantum Monte Carlo methods (FermiNet, PauliNet) targeted molecules. Extending to periodic systems requires fundamentally different architecture choices — this paper solves the translational symmetry and k-point challenges, enabling ab-initio accuracy for bulk materials.

Key Innovations:

[object Object]
[object Object]

Evidence:

— undefined
— undefined
— undefined

Impact: Opens a path to accurate first-principles simulation of correlated materials (high-Tc superconductors, catalytic surfaces) where DFT fails and CC is computationally intractable.

3. Learning to Disprove: Formal Counterexample Generation with Large Language Models

Why Novel: AI math systems are almost exclusively trained to prove. This paper identifies counterexample generation as a distinct skill, builds a mutation-based dataset of falsifiable conjectures, and demonstrates that proving ability does not transfer to disproving — a gap unaddressed in prior work.

Key Innovations:

[object Object]
[object Object]

Evidence:

— undefined
— undefined
— undefined

Impact: Reframes mathematical AI as requiring bilateral reasoning (prove AND disprove); provides dataset and training recipe for the neglected disproving direction.

4. HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Why Novel: Existing benchmarks (AIME, FrontierMath) either use competition problems that may be memorized or lack automatic verification. HorizonMath uses open problems where answers are formally verifiable but not yet known, making it leakage-proof and progress-tracking by design.

Key Innovations:

[object Object]
[object Object]

Evidence:

— undefined
— undefined

Impact: Provides the field with a durable, non-saturating benchmark for genuine mathematical discovery — not just performance on known problems.

Trends

Force field and wavefunction methods are rapidly converging on QM accuracy via neural networks — both molecular dynamics (Garnet) and wavefunction-level (neural backflow for solids) saw major advances this week.
Mathematical reasoning is bifurcating: proving and disproving are recognized as distinct capabilities requiring separate training, moving the field beyond monolithic reasoning models.
Benchmark inflation pressure is generating a corrective response — HorizonMath represents a push toward non-saturating, leakage-proof evaluation grounded in genuinely open problems.
Agentic science is maturing: multiple papers this week move beyond single-step AI assistance toward closed-loop experimental cycles (self-driving microscopy, embodied science, synthetic task scaling).
LLM-driven materials discovery with physics validation (DFT, MD) is emerging as a credible workflow, with RAG providing the domain knowledge bridge between generative exploration and physical grounding.

Notable Papers (6)

1. LLM-Driven Discovery of High-Entropy Catalysts via Retrieval-Augmented Generation

RAG-augmented LLM pipeline discovers high-entropy alloy OER catalysts achieving 0.362V overpotential (vs. 320-370 mV for precious metals) with 82.4% structural stability, demonstrated via DFT validation and full ablation study.

2. Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Hierarchical decomposition + LLM completion reaches state-of-the-art on Verina, Clever, and AlgoVeri Lean verification benchmarks, generating proofs averaging 154 lines.

3. AI Scientist via Synthetic Task Scaling

Synthetic task scaling (generating 256 training trajectories per research task) significantly improves AI scientist agents' ability to run complete research cycles without human intervention.

4. Embodied Science: Closing the Discovery Loop with Agentic Embodied AI

Proposes an architectural framework where embodied AI agents close the full scientific loop — hypothesis, experiment, observation, revision — in physical lab settings.

5. Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Neuro-symbolic approach combining LLM tactic suggestion with symbolic search achieves strong proof success rates on hard systems verification benchmarks.

6. Machines acquire scientific taste from institutional traces

Empirical study showing AI systems trained on institutional publication traces learn implicit quality signals ('scientific taste') that go beyond verifiable correctness.

Honorable Mentions

Formalization of QFT ()
Accelerating Structure-Property Relationship Discovery with Multimodal Machine Learning and Self-Driving Microscopy ()
Toward Reliable, Safe, and Secure LLMs for Scientific Applications ()
Polarization Dynamics in Ferroelectrics: Insights Enabled by Machine Learning Molecular Dynamics ()