Table of Contents
Fetching ...

Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design

Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, Morteza Haghir Chehreghani

TL;DR

The paper tackles the reward bottleneck in RL for de novo drug design by proposing a diverse mini-batch selection framework that samples a large batch of trajectories and selects a small, diverse subset for policy updates using $k$-DPP, MaxMin, or $k$-medoids. It constructs a kernel $L$ from Morgan fingerprint similarity $L_T$ and scaffold-based Dice similarity $L_D$ (with $L = L_T + L_D$) and evaluates four DPP configurations, plus MaxMin and $k$-medoids, across DRD2, GSK3β, and JNK3 tasks within REINVENT. The results show that DPP-based mini-batch diversification enhances both distance-based and reference-based diversity while maintaining competitive rewards, especially when combined with reward-modifying strategies like TanhRND; MaxMin often yields the strongest diversity in actives, whereas k-medoids can underperform. The findings suggest that diverse mini-batch learning can mitigate mode collapse and improve exploration, with practical implications for accelerating drug discovery and potentially generalizing to other RL settings with costly evaluations.

Abstract

In many real-world applications, evaluating the quality of instances is costly and time-consuming, e.g., human feedback and physics simulations, in contrast to proposing new instances. In particular, this is even more critical in reinforcement learning, since it relies on interactions with the environment (i.e., new instances) that must be evaluated to provide a reward signal for learning. At the same time, performing sufficient exploration is crucial in reinforcement learning to find high-rewarding solutions, meaning that the agent should observe and learn from a diverse set of experiences to find different solutions. Thus, we argue that learning from a diverse mini-batch of experiences can have a large impact on the exploration and help mitigate mode collapse. In this paper, we introduce mini-batch diversification for reinforcement learning and study this framework in the context of a real-world problem, namely, drug discovery. We extensively evaluate how our proposed framework can enhance the effectiveness of chemical exploration in de novo drug design, where finding diverse and high-quality solutions is crucial. Our experiments demonstrate that our proposed diverse mini-batch selection framework can substantially enhance the diversity of solutions while maintaining high-quality solutions. In drug discovery, such an outcome can potentially lead to fulfilling unmet medical needs faster.

Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design

TL;DR

The paper tackles the reward bottleneck in RL for de novo drug design by proposing a diverse mini-batch selection framework that samples a large batch of trajectories and selects a small, diverse subset for policy updates using -DPP, MaxMin, or -medoids. It constructs a kernel from Morgan fingerprint similarity and scaffold-based Dice similarity (with ) and evaluates four DPP configurations, plus MaxMin and -medoids, across DRD2, GSK3β, and JNK3 tasks within REINVENT. The results show that DPP-based mini-batch diversification enhances both distance-based and reference-based diversity while maintaining competitive rewards, especially when combined with reward-modifying strategies like TanhRND; MaxMin often yields the strongest diversity in actives, whereas k-medoids can underperform. The findings suggest that diverse mini-batch learning can mitigate mode collapse and improve exploration, with practical implications for accelerating drug discovery and potentially generalizing to other RL settings with costly evaluations.

Abstract

In many real-world applications, evaluating the quality of instances is costly and time-consuming, e.g., human feedback and physics simulations, in contrast to proposing new instances. In particular, this is even more critical in reinforcement learning, since it relies on interactions with the environment (i.e., new instances) that must be evaluated to provide a reward signal for learning. At the same time, performing sufficient exploration is crucial in reinforcement learning to find high-rewarding solutions, meaning that the agent should observe and learn from a diverse set of experiences to find different solutions. Thus, we argue that learning from a diverse mini-batch of experiences can have a large impact on the exploration and help mitigate mode collapse. In this paper, we introduce mini-batch diversification for reinforcement learning and study this framework in the context of a real-world problem, namely, drug discovery. We extensively evaluate how our proposed framework can enhance the effectiveness of chemical exploration in de novo drug design, where finding diverse and high-quality solutions is crucial. Our experiments demonstrate that our proposed diverse mini-batch selection framework can substantially enhance the diversity of solutions while maintaining high-quality solutions. In drug discovery, such an outcome can potentially lead to fulfilling unmet medical needs faster.

Paper Structure

This paper contains 27 sections, 13 equations, 14 figures, 3 tables, 2 algorithms.

Figures (14)

  • Figure 1: We propose a framework for diverse mini-batch selection in reinforcement learning. The RL agent generates a set of experiences in parallel, e.g., trajectories. A kernel measures the pairwise similarities between trajectories and is used to select a diverse set. The selected set is evaluated and, subsequently, is used to update the RL agent.
  • Figure 2: Average extrinsic rewards per generative step across the mini-batch of SMILES evaluated on the DRD2-, GSK3$\beta$-, or JNK3-based reward functions. For clarity of presentation, we display the moving averages with a window size of 101. The average across 10 independent runs per generative step is plotted over 10000.0 generative steps, where the shaded area shows standard deviations among the independent runs.
  • Figure 3: Total number of diverse activities after $g$ generative steps evaluated on reward functions based on the DRD2, GSK3$\beta$, or JNK3 predictive model. The total number of diverse actives is plotted for every 250th generative step. The average across 10 independent runs per generative step is plotted over 10000.0 generative steps, where the shaded area shows standard deviations among the independent runs.
  • Figure 4: Total number of molecular scaffolds after $g$ generative steps evaluated on reward functions based on the DRD2, GSK3$\beta$, or JNK3 predictive model. The average across 10 independent runs per generative step is plotted over 10000.0 generative steps, where the shaded area shows standard deviations among the independent runs.
  • Figure 5: Average extrinsic rewards per generative step across the mini-batch of SMILES evaluated on the DRD2-, GSK3$\beta$-, or JNK3-based reward functions. For clarity of presentation, we display the moving averages with a window size of 101.
  • ...and 9 more figures