Table of Contents
Fetching ...

Distilling and exploiting quantitative insights from Large Language Models for enhanced Bayesian optimization of chemical reactions

Roshan Patel, Saeed Moayedpour, Louis De Lescure, Lorenzo Kogler-Anele, Alan Cherney, Sven Jager, Yasser Jangjou

TL;DR

The paper tackles accelerating chemical reaction optimization by marrying Bayesian optimization with information distilled from large language models. It introduces a survey-based prompting scheme to extract a utility function $g(x)$ via LLM preferences and uses a binary, percentile-weighted acquisition to bias BO toward promising regions, without fine-tuning the LLM. Empirical results across six reaction datasets show modest correlations between $g(x)$ and yields and demonstrate improved BO efficiency and initial-seed quality in several datasets. This approach offers a scalable pathway to incorporate chemistry knowledge embedded in LLMs into principled optimization, potentially reducing experimental cost while maintaining generality across reaction types.

Abstract

Machine learning and Bayesian optimization (BO) algorithms can significantly accelerate the optimization of chemical reactions. Transfer learning can bolster the effectiveness of BO algorithms in low-data regimes by leveraging pre-existing chemical information or data outside the direct optimization task (i.e., source data). Large language models (LLMs) have demonstrated that chemical information present in foundation training data can give them utility for processing chemical data. Furthermore, they can be augmented with and help synthesize potentially multiple modalities of source chemical data germane to the optimization task. In this work, we examine how chemical information from LLMs can be elicited and used for transfer learning to accelerate the BO of reaction conditions to maximize yield. Specifically, we show that a survey-like prompting scheme and preference learning can be used to infer a utility function which models prior chemical information embedded in LLMs over a chemical parameter space; we find that the utility function shows modest correlation to true experimental measurements (yield) over the parameter space despite operating in a zero-shot setting. Furthermore, we show that the utility function can be leveraged to focus BO efforts in promising regions of the parameter space, improving the yield of the initial BO query and enhancing optimization in 4 of the 6 datasets studied. Overall, we view this work as a step towards bridging the gap between the chemistry knowledge embedded in LLMs and the capabilities of principled BO methods to accelerate reaction optimization.

Distilling and exploiting quantitative insights from Large Language Models for enhanced Bayesian optimization of chemical reactions

TL;DR

The paper tackles accelerating chemical reaction optimization by marrying Bayesian optimization with information distilled from large language models. It introduces a survey-based prompting scheme to extract a utility function via LLM preferences and uses a binary, percentile-weighted acquisition to bias BO toward promising regions, without fine-tuning the LLM. Empirical results across six reaction datasets show modest correlations between and yields and demonstrate improved BO efficiency and initial-seed quality in several datasets. This approach offers a scalable pathway to incorporate chemistry knowledge embedded in LLMs into principled optimization, potentially reducing experimental cost while maintaining generality across reaction types.

Abstract

Machine learning and Bayesian optimization (BO) algorithms can significantly accelerate the optimization of chemical reactions. Transfer learning can bolster the effectiveness of BO algorithms in low-data regimes by leveraging pre-existing chemical information or data outside the direct optimization task (i.e., source data). Large language models (LLMs) have demonstrated that chemical information present in foundation training data can give them utility for processing chemical data. Furthermore, they can be augmented with and help synthesize potentially multiple modalities of source chemical data germane to the optimization task. In this work, we examine how chemical information from LLMs can be elicited and used for transfer learning to accelerate the BO of reaction conditions to maximize yield. Specifically, we show that a survey-like prompting scheme and preference learning can be used to infer a utility function which models prior chemical information embedded in LLMs over a chemical parameter space; we find that the utility function shows modest correlation to true experimental measurements (yield) over the parameter space despite operating in a zero-shot setting. Furthermore, we show that the utility function can be leveraged to focus BO efforts in promising regions of the parameter space, improving the yield of the initial BO query and enhancing optimization in 4 of the 6 datasets studied. Overall, we view this work as a step towards bridging the gap between the chemistry knowledge embedded in LLMs and the capabilities of principled BO methods to accelerate reaction optimization.

Paper Structure

This paper contains 15 sections, 7 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: A schematic outlining the major elements of the approach in this study.
  • Figure 1: Optimized $p(n)$ used for the LLM-EI acquisition function
  • Figure 2: Assessing the correlation between utility function outputs computed for experiments across all datasets and their true measured yield. The Pearson r correlation between utility values and yields are 0.55, 0.63, 0.67, 0.22, 0.49, and 0.48 for datasets BH 1, BH 2, BH 3, BH 4, BH 5, and DA respectively, with a p-value $<$ 1e-10 across all datasets. The least squares regression line between utility values and yields is plotted for each panel in a dotted black line to guide the eye.
  • Figure 2: Example of LLM reasoning in answering survey
  • Figure 3: Comparing BO of reaction parameters using the expected improvement acquisition function versus the LLM-preference guided expected improvement acquisition. Panel a shows plots the best measured yield observed as a function of the number of experiments performed for each dataset for BO campaigns employing a given acquisition function. The plotted line represents the mean value seen at a given N. exp over the course of n = 50 randomly seeded campaigns. Panel b shows the mean (and standard error from n=50 trials) number of experiments needed to run to observe the maximum yield for a given dataset and acquisition function. Panel c shows the average yield (and standard error from n=50 trials) observed in the initial experiment selected during BO. Values have been normalized by the maximum observed yield for a given dataset.
  • ...and 2 more figures