Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design
Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, Morteza Haghir Chehreghani
TL;DR
The paper tackles the reward bottleneck in RL for de novo drug design by proposing a diverse mini-batch selection framework that samples a large batch of trajectories and selects a small, diverse subset for policy updates using $k$-DPP, MaxMin, or $k$-medoids. It constructs a kernel $L$ from Morgan fingerprint similarity $L_T$ and scaffold-based Dice similarity $L_D$ (with $L = L_T + L_D$) and evaluates four DPP configurations, plus MaxMin and $k$-medoids, across DRD2, GSK3β, and JNK3 tasks within REINVENT. The results show that DPP-based mini-batch diversification enhances both distance-based and reference-based diversity while maintaining competitive rewards, especially when combined with reward-modifying strategies like TanhRND; MaxMin often yields the strongest diversity in actives, whereas k-medoids can underperform. The findings suggest that diverse mini-batch learning can mitigate mode collapse and improve exploration, with practical implications for accelerating drug discovery and potentially generalizing to other RL settings with costly evaluations.
Abstract
In many real-world applications, evaluating the quality of instances is costly and time-consuming, e.g., human feedback and physics simulations, in contrast to proposing new instances. In particular, this is even more critical in reinforcement learning, since it relies on interactions with the environment (i.e., new instances) that must be evaluated to provide a reward signal for learning. At the same time, performing sufficient exploration is crucial in reinforcement learning to find high-rewarding solutions, meaning that the agent should observe and learn from a diverse set of experiences to find different solutions. Thus, we argue that learning from a diverse mini-batch of experiences can have a large impact on the exploration and help mitigate mode collapse. In this paper, we introduce mini-batch diversification for reinforcement learning and study this framework in the context of a real-world problem, namely, drug discovery. We extensively evaluate how our proposed framework can enhance the effectiveness of chemical exploration in de novo drug design, where finding diverse and high-quality solutions is crucial. Our experiments demonstrate that our proposed diverse mini-batch selection framework can substantially enhance the diversity of solutions while maintaining high-quality solutions. In drug discovery, such an outcome can potentially lead to fulfilling unmet medical needs faster.
