A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions
Pengfei Liu, Jun Tao, Zhixiang Ren
TL;DR
Chemical reaction predictions face vast, uncertain reaction spaces and limited exploitation of intrinsic reaction knowledge. The authors present a data-curated, self-feedback knowledge elicitation framework to uncover reaction-type knowledge and inject it into large language models via adaptive prompts, achieving notable gains in retrosynthesis (+14.2%) and reagent prediction (+74.2%), along with improved multi-task CRP capabilities. The approach combines RT annotation through self-feedback clustering, data curation from Mol-Instructions, and a dynamic prompt system with a template library to produce a prompt_enhanced input for LLM-CRP. This work demonstrates the untapped potential of LLMs for scientific knowledge elicitation and proposes a practical paradigm for integrating domain priors into scientific language modeling.
Abstract
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
