Table of Contents
Fetching ...

Large language models as uncertainty-calibrated optimizers for experimental discovery

Bojana Ranković, Ryan-Rhys Griffiths, Philippe Schwaller

TL;DR

GOLLuM addresses the core challenge of uncertainty-aware experimental optimization by unifying large language models with Bayesian optimization through a deep kernel Gaussian process. By training LLM embeddings inside the GP objective, uncertainty guides the adaptation of representations, transforming LLMs from brittle, overconfident predictors into calibrated optimizers that can operate from natural-language descriptions. Across Buchwald–Hartwig reactions and 19 diverse domains, the method nearly doubles high-yield discovery rates and ranks first on average, demonstrating robust cross-domain generalization and interpretable latent organization that aligns with chemical patterns. This framework lowers barriers to AI-guided experimentation by combining the accessibility of language interfaces with principled uncertainty, suggesting a general paradigm for reliable AI-driven discovery in science.

Abstract

Scientific discovery increasingly depends on efficient experimental optimization to navigate vast design spaces under time and resource constraints. Traditional approaches often require extensive domain expertise and feature engineering. While large language models, with their vast scientific knowledge, circumvent the feature engineering limitations, they lack the calibrated uncertainty estimates required for high-stakes decision making. Hence, current optimization methods force a choice between domain knowledge and reliability, with no principled approach that affords both. In this work, we show that training language models through the uncertainty-aware objectives of traditional optimization methods enables their use as reliable optimizers guided by natural language. By teaching LLMs from experimental outcomes under uncertainty, we transform their overconfidence from a fundamental limitation into a precise calibration mechanism. Applied to Buchwald-Hartwig reactions, a cornerstone of pharmaceutical synthesis, our method nearly doubles the discovery rate of high-yielding reaction conditions, from 24% to 43% in 50 experimental iterations starting from 10 unsuccessful conditions. Across 19 diverse optimization problems spanning organic synthesis, materials science and catalysis, process chemistry, and molecular design, our approach ranks first on average, establishing a new paradigm for reliable, uncertainty-guided optimization with LLMs. Our approach can accelerate discovery by lowering the barrier to using powerful optimization methods, replacing the need for domain-specific feature engineering with more accessible natural language interfaces. These findings highlight that ensuring reliability through principled uncertainty quantification is critical for realizing the full potential of AI-guided experimentation.

Large language models as uncertainty-calibrated optimizers for experimental discovery

TL;DR

GOLLuM addresses the core challenge of uncertainty-aware experimental optimization by unifying large language models with Bayesian optimization through a deep kernel Gaussian process. By training LLM embeddings inside the GP objective, uncertainty guides the adaptation of representations, transforming LLMs from brittle, overconfident predictors into calibrated optimizers that can operate from natural-language descriptions. Across Buchwald–Hartwig reactions and 19 diverse domains, the method nearly doubles high-yield discovery rates and ranks first on average, demonstrating robust cross-domain generalization and interpretable latent organization that aligns with chemical patterns. This framework lowers barriers to AI-guided experimentation by combining the accessibility of language interfaces with principled uncertainty, suggesting a general paradigm for reliable AI-driven discovery in science.

Abstract

Scientific discovery increasingly depends on efficient experimental optimization to navigate vast design spaces under time and resource constraints. Traditional approaches often require extensive domain expertise and feature engineering. While large language models, with their vast scientific knowledge, circumvent the feature engineering limitations, they lack the calibrated uncertainty estimates required for high-stakes decision making. Hence, current optimization methods force a choice between domain knowledge and reliability, with no principled approach that affords both. In this work, we show that training language models through the uncertainty-aware objectives of traditional optimization methods enables their use as reliable optimizers guided by natural language. By teaching LLMs from experimental outcomes under uncertainty, we transform their overconfidence from a fundamental limitation into a precise calibration mechanism. Applied to Buchwald-Hartwig reactions, a cornerstone of pharmaceutical synthesis, our method nearly doubles the discovery rate of high-yielding reaction conditions, from 24% to 43% in 50 experimental iterations starting from 10 unsuccessful conditions. Across 19 diverse optimization problems spanning organic synthesis, materials science and catalysis, process chemistry, and molecular design, our approach ranks first on average, establishing a new paradigm for reliable, uncertainty-guided optimization with LLMs. Our approach can accelerate discovery by lowering the barrier to using powerful optimization methods, replacing the need for domain-specific feature engineering with more accessible natural language interfaces. These findings highlight that ensuring reliability through principled uncertainty quantification is critical for realizing the full potential of AI-guided experimentation.

Paper Structure

This paper contains 64 sections, 13 equations, 15 figures, 4 tables, 2 algorithms.

Figures (15)

  • Figure 1: Language models as uncertainty-calibrated optimizersa, Experimental optimization challenges. Scientists face the universal problem of efficiently exploring vast design spaces under time and cost constraints. b, Current approaches impose a trade-off between accessibility and reliability. Bayesian optimization provides principled uncertainty calibration and sample efficiency but requires domain expertise. LLMs offer natural language interfaces and prior knowledge but lack reliability. c, GOLLuM optimization loop. Previous experimental results, described in natural language, feed into joint optimization of both GP and LLM components. Likelihood gradients from the GP finetune the LLM, progressively smoothening the optimization landscape and organizing the search space into distinct regions: "the good", "the bad" (and the ugly) of high and low performance. This structured representation enables easier selection for the next experiment, which becomes a new training point that continues the loop. d, Key results. GOLLuM ranks first averaged across 19 cross-domain tasks spanning materials, reactions, molecules, and processes, enabling sample-efficient discovery from natural language inputs while surpassing both specialized descriptors, domain-pretrained models, and proprietary LLMs.
  • Figure 2: From LLM prompting to LLMs as principled scientific optimizers.a, Querying LLMs for optimization. Given the parameter space and previous experiments, the LLM proposes next suggestions. This method relies on heuristic search and suffers from a lack of uncertainty calibration, which leads to invalid suggestions, hallucinations, and frequent failures. b, GOLLuM framework integrates an LLM and a GP in a joint optimization loop. Gradients from the GP's probabilistic objective finetune the LLM, creating an interpretable search space that enables principled, uncertainty-aware, and sample-efficient optimization. c, Quantitative results on the Buchwald-Hartwig reactions benchmark. (Left) The combinatorial design space of 3955 reactions across 5 distinct chemical products, corresponding to $\sim$800 unique conditions per product. (Middle) The analysis of response types from several proprietary LLMs reveals failure rates, with a high fraction of invalid or duplicated suggestions. We executed 5 independent optimization seeds (API-cost limited) per product. Each run cold-started with 10 initial low-performing reactions and repeatedly queried the model until reaching a quota of 50 unique, in-space suggestions. The stacked bars aggregate the raw outputs from all queries issued while reaching that quota across 25 runs (5 products x 5 seeds) per model ($>$1250 queries). Categories: Valid, Duplicate (already proposed in the same run), Invalid (outside the parameter space), and LLM failure (format/parse/timeout). (Right) A comparison of optimization performance ("Top 5% Coverage") shows that direct LLM approaches substantially underperform traditional Bayesian optimization (with reaction fingerprints probst2022reaction) and our proposed GOLLuM method using GP-finetuned general-purpose open-source LLMs (T5 2020t5). Bars show mean and standard error. Performance comparison excludes Claude-3.5 Haiku (did not reach the quota of 50 valid in-space suggestions).
  • Figure 3: Representation geometry enables efficient Bayesian optimizationa, BO performance with fixed LLM features as input to GP. We benchmark three encoder models (ModernBERT modernbert, UAE li2023angle, MXBAI emb2024mxbai), three encoder--decoders (Instructor su2022one, T5 2020t5 and its chemistry-related variant T5Chem christofidellis2023unifying) and four decoder families (Llama grattafiori2024llamabehnamghader2024llm2vec, Mistral behnamghader2024llm2vec, Qwen bai2023qwenli2023towards, and OpenAI openai2024new). We show average discovery of high-impact regions of the design space as the percentage of the top 5% reactions found during the optimization, aggregated across all five Buchwald-Hartwig reactions. Chemistry-related representations include T5Chem-SMILESchristofidellis2023unifying, a pretrained chemistry-related LLM with SMILES input, and DFRPprobst2022reaction, a reaction fingerprint. b, Comparative analysis of GP-based LLM finetuning. The finetuned models are arranged by the overall performance and relative improvements to their base (fixed embeddings) LLM-GP variants. Chemistry-related baselines (previous best) included for comparison. All results show mean and standard error across 20 independent seeds, with each optimization run starting from 10 initial points (random below median) and selecting subsequent points via acquisition function maximization over the enumerated design space. c, Data representations and their success rates in BO. BO performance correlates with GP smoothness, measured as the ratio of learned lengthscale to average pairwise embedding distance. Higher ratios indicate smoother GP fits, implying the representation supports meaningful generalization across the design space.
  • Figure 4: Implicit contrastive learning and emergent chemical interpretability with LLM-based deep kernel GPs.a, The raw design space of one of the Buchwald-Hartwig reactions, showing reaction yields across different combinations of additives, bases, and ligands for three different aryl halides (I, Br, Cl). b, A schematic of LLM-based deep kernel metric learning. The joint optimization of the LLM and GP encourages the model to learn embeddings where experiments with similar outcomes position close together in the latent space, resulting in high kernel similarity and low pairwise distance. c, The evolution of the LLM's latent space during the optimization process. Starting from an unstructured state (Iteration 0), the space becomes progressively more organized (Iteration 25), eventually forming a chemically meaningful map where reactions cluster based on key reactivity patterns and performance (Iteration 50). d, The link between implicit contrastive learning and optimization performance. The main plot shows that the adaptive model (PLLM$\phi$+T5) significantly outperforms the static model (T5). The inset histograms show the distribution of pairwise L2 distances between high-yielding and low-yielding experiments at different iterations. As the optimization progresses, the model learns to separate high- and low-performing points, creating a more structured space that enables more efficient discovery.
  • Figure 5: Benchmarking on various chemistry-related optimization tasks with comparisons to related approaches.a, On average across all 19 benchmarks, our GOLLuM framework (PLLM$\phi$+T5 or T5Chem) outperforms traditional domain-specific features (GP), static LLM embeddings (Bochemian rankovic2023bochemian), and decoupled uncertainty calibration methods (LAPEFT kristiadi2024sober). Bars show mean and standard error across 20 seeds per method per benchmark (10 below-median initial points, sequential acquisition-based optimization). b, The framework demonstrates robustness to variations in input format (textual procedure vs. SMILES) and eliminates the need for chemistry-specific pretraining, as the generalist T5 2020t5 model performs as well as T5Chem christofidellis2023unifyingc, Detailed results for all 19 optimization tasks, grouped by scientific domains. Synthetic organic chemistry: Buchwald-Hartwig ahneman2018predicting (BH1-5), Suzuki Miyaura perera2018platform and Ni-catalyzed arylation prieto2022accelerating reactions (Additives1-4). Analytic and process chemistry with high-performance liquid chromatography (HPLC setup) hase_olympus_2021. Materials science and catalysis: oxygen evolution reaction catalysts (OER hase_olympus_2021), C2 yield optimization ramos2023bayesian, vapor diffusion crystallization (Vapdiff) hase_olympus_2021. Molecular optimization: minimizing redox potential (Redox) Agarwal2021 and solvation energy (Solvation) Agarwal2021, inhibiting kinase activity (Kinase) Graff2021, and maximizing photoswitch absorption wavelengths (Photoswitch) Griffiths2022a and power conversion efficiency (PCE) Lopez2016. Our method consistently achieves top-tier performance, demonstrating broad applicability from organic synthesis and materials science to molecular property and process optimization. We provide example procedural templates for tasks where SMILES is not applicable.
  • ...and 10 more figures