Table of Contents
Fetching ...

Entropy-Reinforced Planning with Large Language Models for Drug Discovery

Xuefeng Liu, Chih-chan Tien, Peng Ding, Songhao Jiang, Rick L. Stevens

TL;DR

This work tackles the challenge of de novo drug discovery with large language models, where pure decoding often yields invalid molecules or suboptimal multi-property candidates. It introduces ERP, Entropy-Reinforced Planning for Transformer Decoding, which embeds an entropy-based planning module into MCTS-guided Transformer decoding, using a novel selection rule $ ext{P}\mathcal{H}\text{-UCT}$ and Top-$P$/Top-$K$ expansion to balance exploitation and exploration across multiple objectives via a multi-critic reward $R^{\text{sum}}_{\text{norm}}$. Across SARS-CoV-2 3CLPro and RTCB targets, ERP consistently outperforms state-of-the-art PG-TD and baselines, with improvements robust to pretrained, biased, and RL-finetuned LLMs, and also demonstrates superior performance on code-generation benchmarks through the same entropy-reinforced planning paradigm. The approach improves sample efficiency, controllable generation, and the discovery of high-reward molecular spaces, with practical implications for accelerating multi-objective drug design and potentially benefiting diverse generative tasks beyond chemistry. $e$-step forward entropy and a multi-critic framework enable ERP to navigate uncertain regions of the search space more effectively than prior planning methods, making ERP a versatile tool for structured sequence generation in complex domains.

Abstract

The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: https://github.com/xuefeng-cs/ERP.

Entropy-Reinforced Planning with Large Language Models for Drug Discovery

TL;DR

This work tackles the challenge of de novo drug discovery with large language models, where pure decoding often yields invalid molecules or suboptimal multi-property candidates. It introduces ERP, Entropy-Reinforced Planning for Transformer Decoding, which embeds an entropy-based planning module into MCTS-guided Transformer decoding, using a novel selection rule and Top-/Top- expansion to balance exploitation and exploration across multiple objectives via a multi-critic reward . Across SARS-CoV-2 3CLPro and RTCB targets, ERP consistently outperforms state-of-the-art PG-TD and baselines, with improvements robust to pretrained, biased, and RL-finetuned LLMs, and also demonstrates superior performance on code-generation benchmarks through the same entropy-reinforced planning paradigm. The approach improves sample efficiency, controllable generation, and the discovery of high-reward molecular spaces, with practical implications for accelerating multi-objective drug design and potentially benefiting diverse generative tasks beyond chemistry. -step forward entropy and a multi-critic framework enable ERP to navigate uncertain regions of the search space more effectively than prior planning methods, making ERP a versatile tool for structured sequence generation in complex domains.

Abstract

The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: https://github.com/xuefeng-cs/ERP.
Paper Structure (41 sections, 13 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 41 sections, 13 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of the application of the ERP algorithm in the Transformer's process for generating molecules. Here, <s> denotes the start token [BOS], and <e> signifies the end token [EOS]. The parts highlighted in red are from Transformer.
  • Figure 2: An environment with action space $\mathcal{A} = {\text{left}, \text{right}}$, in which each node (state) is connected by only two edges (actions), and each edge is associated with a probability of being sampled, as determined by the pretrained Transformer decoder $\pi_{\theta}$. The red values are inferred by the Transformer.
  • Figure 3: Ablation studies. (a)(b)(c)(d)(e) Normalized rewards averaged among valid molecules for different LLMs and algorithms. ERP is our model. PG-TD zhang2023planning is the previous state-of-the-art method as described by Eq. (\ref{['eq:P-UCT']}), and UCT by Eq (\ref{['eq:UCB']}). We also do random sampling from the LM as the Sampling baseline. (f) Filtered the top 10 leading compounds from the molecules discovered in (e). (g) ERP vs. PG-TD for number of unique valid molecules in 3CLPro dataset across different rollouts. (h) Effects of entropy step $e$ of ERP.
  • Figure 4: The binding sites of RTCB (PDB ID: 4DWQ) and proteins 3CLPro (PDB ID: 7BQY). The Open Eye software is used to define binding sites surrounding the crystallized compound kelley2015positliu23Drug.