Table of Contents
Fetching ...

A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions

Pengfei Liu, Jun Tao, Zhixiang Ren

TL;DR

Chemical reaction predictions face vast, uncertain reaction spaces and limited exploitation of intrinsic reaction knowledge. The authors present a data-curated, self-feedback knowledge elicitation framework to uncover reaction-type knowledge and inject it into large language models via adaptive prompts, achieving notable gains in retrosynthesis (+14.2%) and reagent prediction (+74.2%), along with improved multi-task CRP capabilities. The approach combines RT annotation through self-feedback clustering, data curation from Mol-Instructions, and a dynamic prompt system with a template library to produce a prompt_enhanced input for LLM-CRP. This work demonstrates the untapped potential of LLMs for scientific knowledge elicitation and proposes a practical paradigm for integrating domain priors into scientific language modeling.

Abstract

The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.

A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions

TL;DR

Chemical reaction predictions face vast, uncertain reaction spaces and limited exploitation of intrinsic reaction knowledge. The authors present a data-curated, self-feedback knowledge elicitation framework to uncover reaction-type knowledge and inject it into large language models via adaptive prompts, achieving notable gains in retrosynthesis (+14.2%) and reagent prediction (+74.2%), along with improved multi-task CRP capabilities. The approach combines RT annotation through self-feedback clustering, data curation from Mol-Instructions, and a dynamic prompt system with a template library to produce a prompt_enhanced input for LLM-CRP. This work demonstrates the untapped potential of LLMs for scientific knowledge elicitation and proposes a practical paradigm for integrating domain priors into scientific language modeling.

Abstract

The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
Paper Structure (18 sections, 13 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of tasks and approaches. (a) Chemical reaction prediction tasks, showcasing three tasks along with examples. (b) Current LLM methods for CRPs, indicating rational predictions but lacking in reactive validity. (c) Self-feedback knowledge elicitation for enhancing CRPs, demonstrating the enhancement of CRPs through the refinement of knowledge patterns, notably RTs, utilizing a self-feedback knowledge elicitation technique. Knowledge elicitation serves as a method of data curation for knowledge distillation, where RT is integrated into large language models via adaptive prompt learning, facilitating the planning of reaction pathways in CRPs.
  • Figure 2: Three-stage training scheme of prompt-based knowledge elicitation. Knowledge extraction, the datasets are divided into train, valid, and test sets. The training dataset's inputs and outputs are clustered using LLM-RT embeddings, leading to RT annotations. The annotation accuracy of LLM-RT is refined by iteratively tuning cluster parameters and training with input and RT, aiming to improve precision and identify the best cluster. Data curation, the trained LLM-RT annotates the RTs for the validation and testing datasets based on their inputs. Adaptive knowledge injection, adaptability is calculated based on the embeddings of inputs and instructions, leading to the selection of adaptive instructions. It is followed by fine-tuning the LLM with prompts that are enhanced with prior knowledge.
  • Figure 3: Performance of encoding vector self-feedback annotation and clustering. (a) Accuracy of RT annotations across encoding vectors and clustering number, we compare the annotation accuracy $Acc$ among four encoding methods alongside reasonable Cluster Numbers $N$. The results indicate that the encoding method using ($concat(input, output)_{vec}$) yields the best performance. (b) The test dataset vector ($concat(input, output)_{vec}$) clustering, with the $N$ set to 6 and $N$ set to 10, test dataset vectors are reduced to two dimensions via a linear layer to display the clustering outcome.
  • Figure 4: Case studies of RT annotation. To validate the practical significance of RT annotation, we filter through the $concat(input, output)_{vec}$ vector with $N=10$ labeled results, focusing on samples with an RT label of 0. The molecules in these instances transform simple atomic substitutions. This analysis verifies the predominance of substitution reactions within these cases, demonstrating the real-world relevance of our RT annotation method.