ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System
Dong Han, Zhehong Ai, Pengxiang Cai, Shanya Lu, Jianpeng Chen, Zihao Ye, Shuzhou Sun, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao Xu, Yuqiang Li, Shufei Zhang
TL;DR
ChemBOMAS tackles the data-scarce, high-dimensional nature of chemical optimization by coupling an LLM-driven knowledge module for search-space decomposition with a data-driven LLM regressor that generates informative pseudo-data. This two-branch framework jointly warm-starts and guides Bayesian optimization within strategically identified subspaces, using a Gaussian Process surrogate with Matérn kernel and acquisition functions, augmented by a Retrieval-Augmented Generation pipeline to partition the space. Across four chemical benchmarks, ChemBOMAS achieves state-of-the-art performance, accelerating convergence by up to 5× and, in a wet-lab validation, discovering a 96% yield under tight experimental constraints where a human expert achieved only 15%. The results demonstrate robust synergy between knowledge-guided space partitioning and data-driven priors, with demonstrated generality to a materials-science benchmark, highlighting substantial practical impact for accelerated chemical discovery.
Abstract
Bayesian optimization (BO) is a powerful tool for scientific discovery in chemistry, yet its efficiency is often hampered by the sparse experimental data and vast search space. Here, we introduce ChemBOMAS: a large language model (LLM)-enhanced multi-agent system that accelerates BO through synergistic data- and knowledge-driven strategies. Firstly, the data-driven strategy involves an 8B-scale LLM regressor fine-tuned on a mere 1% labeled samples for pseudo-data generation, robustly initializing the optimization process. Secondly, the knowledge-driven strategy employs a hybrid Retrieval-Augmented Generation approach to guide LLM in dividing the search space while mitigating LLM hallucinations. An Upper Confidence Bound algorithm then identifies high-potential subspaces within this established partition. Across the LLM-refined subspaces and supported by LLM-generated data, BO achieves the improvement of effectiveness and efficiency. Comprehensive evaluations across multiple scientific benchmarks demonstrate that ChemBOMAS set a new state-of-the-art, accelerating optimization efficiency by up to 5-fold compared to baseline methods.
