Table of Contents
Fetching ...

ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System

Dong Han, Zhehong Ai, Pengxiang Cai, Shanya Lu, Jianpeng Chen, Zihao Ye, Shuzhou Sun, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao Xu, Yuqiang Li, Shufei Zhang

TL;DR

ChemBOMAS tackles the data-scarce, high-dimensional nature of chemical optimization by coupling an LLM-driven knowledge module for search-space decomposition with a data-driven LLM regressor that generates informative pseudo-data. This two-branch framework jointly warm-starts and guides Bayesian optimization within strategically identified subspaces, using a Gaussian Process surrogate with Matérn kernel and acquisition functions, augmented by a Retrieval-Augmented Generation pipeline to partition the space. Across four chemical benchmarks, ChemBOMAS achieves state-of-the-art performance, accelerating convergence by up to 5× and, in a wet-lab validation, discovering a 96% yield under tight experimental constraints where a human expert achieved only 15%. The results demonstrate robust synergy between knowledge-guided space partitioning and data-driven priors, with demonstrated generality to a materials-science benchmark, highlighting substantial practical impact for accelerated chemical discovery.

Abstract

Bayesian optimization (BO) is a powerful tool for scientific discovery in chemistry, yet its efficiency is often hampered by the sparse experimental data and vast search space. Here, we introduce ChemBOMAS: a large language model (LLM)-enhanced multi-agent system that accelerates BO through synergistic data- and knowledge-driven strategies. Firstly, the data-driven strategy involves an 8B-scale LLM regressor fine-tuned on a mere 1% labeled samples for pseudo-data generation, robustly initializing the optimization process. Secondly, the knowledge-driven strategy employs a hybrid Retrieval-Augmented Generation approach to guide LLM in dividing the search space while mitigating LLM hallucinations. An Upper Confidence Bound algorithm then identifies high-potential subspaces within this established partition. Across the LLM-refined subspaces and supported by LLM-generated data, BO achieves the improvement of effectiveness and efficiency. Comprehensive evaluations across multiple scientific benchmarks demonstrate that ChemBOMAS set a new state-of-the-art, accelerating optimization efficiency by up to 5-fold compared to baseline methods.

ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System

TL;DR

ChemBOMAS tackles the data-scarce, high-dimensional nature of chemical optimization by coupling an LLM-driven knowledge module for search-space decomposition with a data-driven LLM regressor that generates informative pseudo-data. This two-branch framework jointly warm-starts and guides Bayesian optimization within strategically identified subspaces, using a Gaussian Process surrogate with Matérn kernel and acquisition functions, augmented by a Retrieval-Augmented Generation pipeline to partition the space. Across four chemical benchmarks, ChemBOMAS achieves state-of-the-art performance, accelerating convergence by up to 5× and, in a wet-lab validation, discovering a 96% yield under tight experimental constraints where a human expert achieved only 15%. The results demonstrate robust synergy between knowledge-guided space partitioning and data-driven priors, with demonstrated generality to a materials-science benchmark, highlighting substantial practical impact for accelerated chemical discovery.

Abstract

Bayesian optimization (BO) is a powerful tool for scientific discovery in chemistry, yet its efficiency is often hampered by the sparse experimental data and vast search space. Here, we introduce ChemBOMAS: a large language model (LLM)-enhanced multi-agent system that accelerates BO through synergistic data- and knowledge-driven strategies. Firstly, the data-driven strategy involves an 8B-scale LLM regressor fine-tuned on a mere 1% labeled samples for pseudo-data generation, robustly initializing the optimization process. Secondly, the knowledge-driven strategy employs a hybrid Retrieval-Augmented Generation approach to guide LLM in dividing the search space while mitigating LLM hallucinations. An Upper Confidence Bound algorithm then identifies high-potential subspaces within this established partition. Across the LLM-refined subspaces and supported by LLM-generated data, BO achieves the improvement of effectiveness and efficiency. Comprehensive evaluations across multiple scientific benchmarks demonstrate that ChemBOMAS set a new state-of-the-art, accelerating optimization efficiency by up to 5-fold compared to baseline methods.

Paper Structure

This paper contains 44 sections, 3 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: ChemBOMAS: A synergistic knowledge- and data-driven framework for efficient Bayesian Optimization. The framework operates as a closed-loop system: the knowledge-driven module decomposes the search space into subspaces using LLM-extracted chemical insights, followed by a UCB algorithm to select promising subspaces; the data-driven module generates pseudo-data to initialize both the subspace selection and the Bayesian Optimization process within the selected subspaces. The two modules interact iteratively, with real data from optimization feedback refining subsequent search directions.
  • Figure 2: Optimization performance comparison between ChemBOMAS and baseline methods on the four benchmark datasets: (a) Suzuki, (b) Arylation, (c) Buchwald$_\text{sub-1}$, and (d) Buchwald$_\text{sub-2}$. ChemBOMAS exhibits accelerated convergence and achieves superior final performance with lower variance across all tasks, demonstrating its enhanced efficiency and robustness.
  • Figure 3: KDE plots illustrating the yield distributions for the four benchmark datasets.
  • Figure 4: Heatmap of the best-found objective value over 40 iterations on the Suzuki dataset for three different tree-building strategies. Each colored block represents the highest value discovered up to that iteration, with the color scale progressing from blue (low) to red (high). The visual similarity in the optimization trajectories demonstrates that both the knowledge-driven (ChemBOMAS$_\text{k-d}$) and data-driven (ChemBOMAS$_\text{d-d}$) methods closely mirror the performance progression of the expert-guided approach.
  • Figure 5: Wet laboratory experiment result. Comparison of 'Best Value Found (%)' over 'Iteration Rounds', showing individual high and low-value observations. Lines indicate maximum values achieved via ChemBOMAS, human experts, and a target threshold.