Table of Contents
Fetching ...

Bayesian Optimization of Catalysis With In-Context Learning

Mayk Caldas Ramos, Shane S. Michtavy, Marc D. Porosoff, Andrew D. White

TL;DR

The paper introduces BO-ICL, a framework that treats language-based experimental procedures as the design space and uses frozen large language models with in-context learning as surrogates in Bayesian optimization. By avoiding feature engineering or model retraining, it demonstrates efficient, zero-shot to few-shot catalyst optimization across ESOL solubility, OCM, alloy-interface, and RWGS datasets, including a real-world RWGS experiment that nears thermodynamic yield within six iterations. Key findings show that LLM-based surrogates can provide uncertainty estimates, achieve competitive or superior BO performance to traditional baselines, and operate directly in language space, enabling rapid, interpretable material design with open-source tooling. The work also discusses calibration, hallucination, data-leakage, and exploration-exploitation considerations, offering practical guidance for deploying language-driven BO in catalysis and broader materials science. Overall, BO-ICL redefines materials representation and accelerates discovery using natural language as a universal interface for optimization.

Abstract

Large language models (LLMs) can perform accurate classification with zero or few examples through in-context learning. We extend this capability to regression with uncertainty estimation using frozen LLMs (e.g., GPT-3.5, Gemini), enabling Bayesian optimization (BO) in natural language without explicit model training or feature engineering. We apply this to materials discovery by representing experimental catalyst synthesis and testing procedures as natural language prompts. A key challenge in materials discovery is the need to characterize suboptimal candidates, which slows progress. While BO is effective for navigating large design spaces, standard surrogate models like Gaussian processes assume smoothness and continuity, an assumption that fails in highly non-linear domains such as heterogeneous catalysis. Our task-agnostic BO workflow overcomes this by operating directly in language space, producing interpretable and actionable predictions without requiring structural or electronic descriptors. On benchmarks like aqueous solubility and oxidative coupling of methane (OCM), BO-ICL matches or outperforms Gaussian processes. In live experiments on the reverse water-gas shift (RWGS) reaction, BO-ICL identifies near-optimal multi-metallic catalysts within six iterations from a pool of 3,700 candidates. Our method redefines materials representation and accelerates discovery, with broad applications across catalysis, materials science, and AI. Code: https://github.com/ur-whitelab/BO-ICL.

Bayesian Optimization of Catalysis With In-Context Learning

TL;DR

The paper introduces BO-ICL, a framework that treats language-based experimental procedures as the design space and uses frozen large language models with in-context learning as surrogates in Bayesian optimization. By avoiding feature engineering or model retraining, it demonstrates efficient, zero-shot to few-shot catalyst optimization across ESOL solubility, OCM, alloy-interface, and RWGS datasets, including a real-world RWGS experiment that nears thermodynamic yield within six iterations. Key findings show that LLM-based surrogates can provide uncertainty estimates, achieve competitive or superior BO performance to traditional baselines, and operate directly in language space, enabling rapid, interpretable material design with open-source tooling. The work also discusses calibration, hallucination, data-leakage, and exploration-exploitation considerations, offering practical guidance for deploying language-driven BO in catalysis and broader materials science. Overall, BO-ICL redefines materials representation and accelerates discovery using natural language as a universal interface for optimization.

Abstract

Large language models (LLMs) can perform accurate classification with zero or few examples through in-context learning. We extend this capability to regression with uncertainty estimation using frozen LLMs (e.g., GPT-3.5, Gemini), enabling Bayesian optimization (BO) in natural language without explicit model training or feature engineering. We apply this to materials discovery by representing experimental catalyst synthesis and testing procedures as natural language prompts. A key challenge in materials discovery is the need to characterize suboptimal candidates, which slows progress. While BO is effective for navigating large design spaces, standard surrogate models like Gaussian processes assume smoothness and continuity, an assumption that fails in highly non-linear domains such as heterogeneous catalysis. Our task-agnostic BO workflow overcomes this by operating directly in language space, producing interpretable and actionable predictions without requiring structural or electronic descriptors. On benchmarks like aqueous solubility and oxidative coupling of methane (OCM), BO-ICL matches or outperforms Gaussian processes. In live experiments on the reverse water-gas shift (RWGS) reaction, BO-ICL identifies near-optimal multi-metallic catalysts within six iterations from a pool of 3,700 candidates. Our method redefines materials representation and accelerates discovery, with broad applications across catalysis, materials science, and AI. Code: https://github.com/ur-whitelab/BO-ICL.
Paper Structure (30 sections, 24 equations, 22 figures, 4 tables, 5 algorithms)

This paper contains 30 sections, 24 equations, 22 figures, 4 tables, 5 algorithms.

Figures (22)

  • Figure 1: A high-level overview of a closed-loop Bayesian optimization (BO) method that uses natural language to represent a material design space for efficient sample space exploration. The workflow involves conversion of tabular data into an experimental procedure, which incorporates both synthesis and reaction parameters. By formatting material parameters for compatibility with state-of-the-art large language models, this approach leverages well-established BO techniques to efficiently identify actionable experimental conditions that maximize a desired objective function. In this figure, we highlight a success case for optimizing catalysts for selective $CO_2$ conversion to $CO$ via BO-ICL
  • Figure 2: Performance comparison of baseline models versus BO-ICL based on the number of points in the model's memory or used to train, as applicable. The top row shows the Mean Absolute Error (MAE) as a function of the number of training samples (N), while the bottom row shows Pearson correlation (r). The models compared include GPT-3.5-turbo-0125, GPT-4o, Kernel Ridge Regression (KRR), k-Nearest Neighbors (KNN), and a Gaussian Process Regressor. The shaded areas represent the range of the predictions in each replicate.
  • Figure 3: Parity plots for the regression task on the OCM dataset across different models. Each model was evaluated over five independent replicates, with each plot aggregating all predicted vs. true values. Reported metrics reflect the mean and standard deviation across replicates. Large language models (LLMs) exhibit comparable performance, with GPT-4o showing a slight edge. Interestingly, kernel ridge regression (KRR) achieves the highest correlation among all models, though it was not further explored due to its lack of uncertainty estimates.
  • Figure 4: Bayesian optimization results for the OCM dataset. All results use an embedded natural language representation of the sampled experimental procedures as the input feature representation. We see convergence rates improve when using gpt-4-0125-preview instead of GPT3.5-turbo (data not shown). While Gemini-2.5-flash requires, on average, 15 iterations to achieve the $99^{th}$ percentile of the OCM dataset distribution, both GPT4o and GPR achieve this goal after only 10 new samples, on average. Additionally, this figure implies that GPR using LLM embeddings performs satisfactorily (for GPR specifics see Section \ref{['ssec:gpr']}).
  • Figure 5:
  • ...and 17 more figures