Table of Contents
Fetching ...

Escaping the Hydrolysis Trap: An Agentic Workflow for Inverse Design of Durable Photocatalytic Covalent Organic Frameworks

Iman Peivaste, Nicolas D. Boscher, Ahmed Makradi, Salim Belouettar

Abstract

Covalent organic frameworks (COFs) are promising photocatalysts for solar hydrogen production, yet the most electronically favorable linkages, imines, hydrolyze rapidly in water, creating a stability--activity trade-off that limits practical deployment. Navigating the combinatorial design space of nodes, linkers, linkages, and functional groups to identify candidates that are simultaneously active and durable remains a formidable challenge. Here we introduce Ara, a large-language-model (LLM) agent that leverages pretrained chemical knowledge, donor--acceptor theory, conjugation effects, and linkage stability hierarchies, to guide the search for photocatalytic COFs satisfying joint band-gap, band-edge, and hydrolytic-stability criteria. Evaluated against random search and Bayesian optimization (BO) over a space consisting of candidates with various nodes, linkers, linkages, and r-groups, screened with a GFN1-xTB fragment pipeline, Ara achieves a 52.7\% hit rate (11.5$\times$ random, p = 0.006), finds its first hit at iteration 12 versus 25 for random search, and significantly outperforms BO (p = 0.006). Inspection of the agent's reasoning traces reveals interpretable chemical logic: early convergence on vinylene and beta-ketoenamine linkages for stability, node selection informed by electron-withdrawing character, and systematic R-group optimization to center the band gap at 2.0 eV. Exhaustive evaluation of the full search space uncovers a complementary exploitation--exploration trade-off between the agent and BO, suggesting that hybrid strategies may combine the strengths of both approaches. These results demonstrate that LLM chemical priors can substantially accelerate multi-criteria materials discovery.

Escaping the Hydrolysis Trap: An Agentic Workflow for Inverse Design of Durable Photocatalytic Covalent Organic Frameworks

Abstract

Covalent organic frameworks (COFs) are promising photocatalysts for solar hydrogen production, yet the most electronically favorable linkages, imines, hydrolyze rapidly in water, creating a stability--activity trade-off that limits practical deployment. Navigating the combinatorial design space of nodes, linkers, linkages, and functional groups to identify candidates that are simultaneously active and durable remains a formidable challenge. Here we introduce Ara, a large-language-model (LLM) agent that leverages pretrained chemical knowledge, donor--acceptor theory, conjugation effects, and linkage stability hierarchies, to guide the search for photocatalytic COFs satisfying joint band-gap, band-edge, and hydrolytic-stability criteria. Evaluated against random search and Bayesian optimization (BO) over a space consisting of candidates with various nodes, linkers, linkages, and r-groups, screened with a GFN1-xTB fragment pipeline, Ara achieves a 52.7\% hit rate (11.5 random, p = 0.006), finds its first hit at iteration 12 versus 25 for random search, and significantly outperforms BO (p = 0.006). Inspection of the agent's reasoning traces reveals interpretable chemical logic: early convergence on vinylene and beta-ketoenamine linkages for stability, node selection informed by electron-withdrawing character, and systematic R-group optimization to center the band gap at 2.0 eV. Exhaustive evaluation of the full search space uncovers a complementary exploitation--exploration trade-off between the agent and BO, suggesting that hybrid strategies may combine the strengths of both approaches. These results demonstrate that LLM chemical priors can substantially accelerate multi-criteria materials discovery.
Paper Structure (5 sections, 5 equations, 4 figures, 2 tables)

This paper contains 5 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the Ara agentic workflow for COF photocatalyst discovery. (a) The combinatorial design space comprises 820 candidates formed from 7 trigonal nodes, 19 ditopic linkers, 4 linkage chemistries of varying hydrolytic stability, and 10 aromatic R-group substituents, subject to chemical compatibility constraints. (b) Each candidate is evaluated through a fragment-based screening pipeline: a node–linker–node repeat unit is assembled via RDKit, embedded in 3D, optimized with GFN1-xTB, and scored for band gap (IP$-$EA, mapped to the DFT scale via a calibrated transfer function), conduction-band minimum (CBM), and composite stability index (S_CSI). A candidate is classified as a hit if it satisfies all three criteria simultaneously. (c) Three search strategies are compared: random sampling, Bayesian optimization with a Gaussian process surrogate on Morgan fingerprints, and the Ara LLM agent, which iteratively proposes candidates with explicit chemical reasoning, receives quantitative feedback, and refines its selections over 200 iterations.
  • Figure 2: Scatter plot of xTB (IP$-$EA) fundamental gap versus DFT band gap for 13 COFs spanning six linkage types, with the linear transfer function overlaid. The calibration set includes boronate ester, boroxine, and triazine linkage types not present in the search space to broaden the range of the transfer function. Note that CTF-1 employs triazine ring formation as the linkage chemistry, distinct from triazine-containing nodes (e.g., TFPT-ald, TAPT) that connect via imine or other bond types.
  • Figure 3: Cumulative hits versus iteration for random search (grey), Bayesian optimization (blue), and Ara (agent, pink). Shaded regions indicate mean $\pm$ s.d. across five random seeds (200 iterations per run). A hit is defined as a candidate satisfying 1.8--2.2 eV band gap, CBM $< 0$ V, and SCSI$\geq 0.7$. The agent outperforms both random search and BO in cumulative hits and discovers its first hit sooner.
  • Figure 4: Linkage-type distribution over search iterations. Fraction of candidate selections belonging to each linkage type (imine, $\beta$-ketoenamine, hydrazone, vinylene) in a rolling window of 20 iterations, averaged over five seeds. Left: random search maintains a roughly uniform distribution across linkage types throughout the run. Right: Ara (agent) shifts from a mixed distribution early in the search to a strong preference for vinylene and, to a lesser extent, $\beta$-ketoenamine by iteration 50, reflecting chemistry-aware prioritization of hydrolytically stable linkages (high SCSI) while exploring node and R-group choices to meet the band-gap and band-edge criteria.