Table of Contents
Fetching ...

Diversity-Aware Reinforcement Learning for de novo Drug Design

Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, Morteza Haghir Chehreghani

TL;DR

The paper tackles mode collapse and limited diversity in RL-driven de novo drug design by proposing a diversity-aware reward framework that combines extrinsic reward penalties with intrinsic rewards. The approach uses a flexible linear form $\hat{R}(A) = f(A) \times R(A) + R_I(A)$ to shape learning, and evaluates five penalty functions (IMS, ErfIMS, LinIMS, SigIMS, TanhIMS) plus eight intrinsic methods (DA, MinDis, MeanDis, MinDisR, MeanDisR, KL-UCB, RND, Inf), including two integrated schemes (TanhRND, TanhInf). Experiments on three targets (GSK3β, JNK3, DRD2) show that combining structure-based penalties with prediction-based exploration generally yields higher diversity across molecular scaffolds, topological scaffolds, and diverse actives, with Inf and TanhInf frequently leading. The results highlight that no single method suffices and that a balanced mix of penalty and intrinsic motivation stabilizes exploration and expands chemical space coverage, potentially improving downstream hit discovery. Overall, the framework advances practical diversity in RL-based drug design and points to future work integrating domain and agent information into reward shaping. $\hat{R}(A) = f(A) \times R(A) + R_I(A)$ serves as the central guide for tuning exploitation vs. exploration.

Abstract

Fine-tuning a pre-trained generative model has demonstrated good performance in generating promising drug molecules. The fine-tuning task is often formulated as a reinforcement learning problem, where previous methods efficiently learn to optimize a reward function to generate potential drug molecules. Nevertheless, in the absence of an adaptive update mechanism for the reward function, the optimization process can become stuck in local optima. The efficacy of the optimal molecule in a local optimization may not translate to usefulness in the subsequent drug optimization process or as a potential standalone clinical candidate. Therefore, it is important to generate a diverse set of promising molecules. Prior work has modified the reward function by penalizing structurally similar molecules, primarily focusing on finding molecules with higher rewards. To date, no study has comprehensively examined how different adaptive update mechanisms for the reward function influence the diversity of generated molecules. In this work, we investigate a wide range of intrinsic motivation methods and strategies to penalize the extrinsic reward, and how they affect the diversity of the set of generated molecules. Our experiments reveal that combining structure- and prediction-based methods generally yields better results in terms of diversity.

Diversity-Aware Reinforcement Learning for de novo Drug Design

TL;DR

The paper tackles mode collapse and limited diversity in RL-driven de novo drug design by proposing a diversity-aware reward framework that combines extrinsic reward penalties with intrinsic rewards. The approach uses a flexible linear form to shape learning, and evaluates five penalty functions (IMS, ErfIMS, LinIMS, SigIMS, TanhIMS) plus eight intrinsic methods (DA, MinDis, MeanDis, MinDisR, MeanDisR, KL-UCB, RND, Inf), including two integrated schemes (TanhRND, TanhInf). Experiments on three targets (GSK3β, JNK3, DRD2) show that combining structure-based penalties with prediction-based exploration generally yields higher diversity across molecular scaffolds, topological scaffolds, and diverse actives, with Inf and TanhInf frequently leading. The results highlight that no single method suffices and that a balanced mix of penalty and intrinsic motivation stabilizes exploration and expands chemical space coverage, potentially improving downstream hit discovery. Overall, the framework advances practical diversity in RL-based drug design and points to future work integrating domain and agent information into reward shaping. serves as the central guide for tuning exploitation vs. exploration.

Abstract

Fine-tuning a pre-trained generative model has demonstrated good performance in generating promising drug molecules. The fine-tuning task is often formulated as a reinforcement learning problem, where previous methods efficiently learn to optimize a reward function to generate potential drug molecules. Nevertheless, in the absence of an adaptive update mechanism for the reward function, the optimization process can become stuck in local optima. The efficacy of the optimal molecule in a local optimization may not translate to usefulness in the subsequent drug optimization process or as a potential standalone clinical candidate. Therefore, it is important to generate a diverse set of promising molecules. Prior work has modified the reward function by penalizing structurally similar molecules, primarily focusing on finding molecules with higher rewards. To date, no study has comprehensively examined how different adaptive update mechanisms for the reward function influence the diversity of generated molecules. In this work, we investigate a wide range of intrinsic motivation methods and strategies to penalize the extrinsic reward, and how they affect the diversity of the set of generated molecules. Our experiments reveal that combining structure- and prediction-based methods generally yields better results in terms of diversity.

Paper Structure

This paper contains 34 sections, 35 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: The proposed diversity-aware RL framework for de novo drug design utilizes extrinsic reward penalty and intrinsic reward to improve the diversity. The RL agent is initialized to the pre-trained prior. The RL agent generates molecules, e.g., in SMILES representation as shown here, and subsequently, the penalty and/or intrinsic reward is used to modify the extrinsic rewards. Each extrinsic reward is multiplied by the corresponding penalty term (equal to one if no penalty is used), while the intrinsic reward (equal to zero if no intrinsic reward is used) is added to the product. The modified rewards are observed by the RL agent and used to update its policy.
  • Figure 2: Evaluation on the GSK3$\beta$ oracle.
  • Figure 3: Evaluation on the JNK3 oracle.
  • Figure 4: Evaluation on the DRD2 oracle.
  • Figure 5: Displays boxplots of the run time over 20 independent reruns for the different oracles.
  • ...and 3 more figures