Table of Contents
Fetching ...

REINFORCE-ING Chemical Language Models for Drug Discovery

Morgan Thomas, Albert Bou, Jose Carlos Gómez-Tamayo, Gary Tresadern, Mazen Ahmad, Gianni De Fabritiis

TL;DR

The paper addresses improving sample efficiency and chemical validity when applying reinforcement learning to chemical language models for de novo drug design. It examines REINFORCE-based learning, introduces a simpler reward-shaping mechanism aligned with a pre-trained prior, and evaluates multiple extensions (baselines, experience replay, hill-climb, RND, KL regularization). On the MolOpt benchmark with a 10,000-molecule budget, ACEGEN configurations achieve state-of-the-art effectiveness and efficiency, and in a JNK3 Boltz2 case study they outperform baselines while yielding drug-like, synthesizable molecules and favorable allosteric selectivity. The work provides practical guidelines and open-source tools for applying RL to CLMs in drug discovery.

Abstract

Chemical language models, combined with reinforcement learning (RL), have shown significant promise to efficiently traverse large chemical spaces for drug discovery. However, the performance of various RL algorithms and their best practices for practical drug discovery are still unclear. Here, starting from the principles of the REINFORCE algorithm, we investigate the effect of different components from RL theory including experience replay, hill-climbing, baselines to reduce variance, and alternative reward shaping. We propose a new regularization method more aligned to REINFORCE than current standard practices, and demonstrate how RL hyperparameters can be fine-tuned for effectiveness and efficiency. Lastly, we apply our learnings to practical drug discovery by demonstrating enhanced learning efficiency on frontier binding affinity models by using Boltz2 as a reward model. We share our RL models used in the ACEGEN repository, and hope the experiments here act as a guide to researchers applying RL to chemical language models for drug discovery.

REINFORCE-ING Chemical Language Models for Drug Discovery

TL;DR

The paper addresses improving sample efficiency and chemical validity when applying reinforcement learning to chemical language models for de novo drug design. It examines REINFORCE-based learning, introduces a simpler reward-shaping mechanism aligned with a pre-trained prior, and evaluates multiple extensions (baselines, experience replay, hill-climb, RND, KL regularization). On the MolOpt benchmark with a 10,000-molecule budget, ACEGEN configurations achieve state-of-the-art effectiveness and efficiency, and in a JNK3 Boltz2 case study they outperform baselines while yielding drug-like, synthesizable molecules and favorable allosteric selectivity. The work provides practical guidelines and open-source tools for applying RL to CLMs in drug discovery.

Abstract

Chemical language models, combined with reinforcement learning (RL), have shown significant promise to efficiently traverse large chemical spaces for drug discovery. However, the performance of various RL algorithms and their best practices for practical drug discovery are still unclear. Here, starting from the principles of the REINFORCE algorithm, we investigate the effect of different components from RL theory including experience replay, hill-climbing, baselines to reduce variance, and alternative reward shaping. We propose a new regularization method more aligned to REINFORCE than current standard practices, and demonstrate how RL hyperparameters can be fine-tuned for effectiveness and efficiency. Lastly, we apply our learnings to practical drug discovery by demonstrating enhanced learning efficiency on frontier binding affinity models by using Boltz2 as a reward model. We share our RL models used in the ACEGEN repository, and hope the experiments here act as a guide to researchers applying RL to chemical language models for drug discovery.

Paper Structure

This paper contains 18 sections, 8 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Performance of REINFORCE with the proposed reward shaping at different values of $\alpha$ and $\sigma$ on the JNK3 MolOpt benchmark task. On the left y-axis (blue) is the JNK3 score during the training step, while on the right y-axis (orange) is the negative log-likelihood (NLL) (lower is more likely, and hence better) of molecules according to the prior policy. Variables are measured during RL training steps until 10,000 molecules have been evaluated. Note that higher $\alpha$ increases learning efficiency and higher $\sigma$ increases the prior likelihood, giving control over exploitation-regularization trade-off.
  • Figure 2: Effect of moving average baseline (MAB) and a leave-one-out baseline (LOO) compared to REINFORCE without baselines. Both baselines increase validity and effectiveness at a small cost to exploration.
  • Figure 3: Effect of different top-k on-policy subset values (hill-climbing) compared to REINFORCE (which corresponds to top-k=1). Decreasing subset sizes considerably increases effectiveness and efficiency at smaller relative cost to validity and uniqueness.
  • Figure 4: Effect of different experience replay sampling strategies compared to REINFORCE with no experience replay. The initial letter indicates the sampling type: prioritized proportional to the molecule's reward (P) or uniform (U). The first number after the letter represents the replay batch size, and the final number indicates the experience replay buffer size. Most experience replay strategies increase effectiveness and efficiency at no cost to validity or uniqueness, with smaller replay buffers having a greater effect.
  • Figure 5: Effect of different reward exponent values compared to REINFORCE (which corresponds to a value of 1). Increasing exponent values that steepen the return gradient lead to increased effectiveness and efficiency at a cost to exploration.
  • ...and 8 more figures