Table of Contents
Fetching ...

De Novo Molecular Design Enabled by Direct Preference Optimization and Curriculum Learning

Junyu Hou

TL;DR

The paper addresses the inefficiencies and instability of RL-based de novo molecular design by introducing Direct Preference Optimization (DPO) combined with curriculum learning. A two-stage pipeline pretrains a molecular prior and then fine-tunes it using DPO with progressively harder, curriculum-guided preference pairs, enabling efficient, stable optimization without explicit reward models. On the GuacaMol benchmark, the method achieves state-of-the-art performance and substantially faster training than prior approaches, while docking experiments show strong binding potential to multiple target proteins. The results suggest a scalable, data-driven pathway for multi-objective molecular design with practical impact on drug discovery and materials science.

Abstract

De novo molecular design has extensive applications in drug discovery and materials science. The vast chemical space renders direct molecular searches computationally prohibitive, while traditional experimental screening is both time- and labor-intensive. Efficient molecular generation and screening methods are therefore essential for accelerating drug discovery and reducing costs. Although reinforcement learning (RL) has been applied to optimize molecular properties via reward mechanisms, its practical utility is limited by issues in training efficiency, convergence, and stability. To address these challenges, we adopt Direct Preference Optimization (DPO) from NLP, which uses molecular score-based sample pairs to maximize the likelihood difference between high- and low-quality molecules, effectively guiding the model toward better compounds. Moreover, integrating curriculum learning further boosts training efficiency and accelerates convergence. A systematic evaluation of the proposed method on the GuacaMol Benchmark yielded excellent scores. For instance, the method achieved a score of 0.883 on the Perindopril MPO task, representing a 6\% improvement over competing models. And subsequent target protein binding experiments confirmed its practical efficacy. These results demonstrate the strong potential of DPO for molecular design tasks and highlight its effectiveness as a robust and efficient solution for data-driven drug discovery.

De Novo Molecular Design Enabled by Direct Preference Optimization and Curriculum Learning

TL;DR

The paper addresses the inefficiencies and instability of RL-based de novo molecular design by introducing Direct Preference Optimization (DPO) combined with curriculum learning. A two-stage pipeline pretrains a molecular prior and then fine-tunes it using DPO with progressively harder, curriculum-guided preference pairs, enabling efficient, stable optimization without explicit reward models. On the GuacaMol benchmark, the method achieves state-of-the-art performance and substantially faster training than prior approaches, while docking experiments show strong binding potential to multiple target proteins. The results suggest a scalable, data-driven pathway for multi-objective molecular design with practical impact on drug discovery and materials science.

Abstract

De novo molecular design has extensive applications in drug discovery and materials science. The vast chemical space renders direct molecular searches computationally prohibitive, while traditional experimental screening is both time- and labor-intensive. Efficient molecular generation and screening methods are therefore essential for accelerating drug discovery and reducing costs. Although reinforcement learning (RL) has been applied to optimize molecular properties via reward mechanisms, its practical utility is limited by issues in training efficiency, convergence, and stability. To address these challenges, we adopt Direct Preference Optimization (DPO) from NLP, which uses molecular score-based sample pairs to maximize the likelihood difference between high- and low-quality molecules, effectively guiding the model toward better compounds. Moreover, integrating curriculum learning further boosts training efficiency and accelerates convergence. A systematic evaluation of the proposed method on the GuacaMol Benchmark yielded excellent scores. For instance, the method achieved a score of 0.883 on the Perindopril MPO task, representing a 6\% improvement over competing models. And subsequent target protein binding experiments confirmed its practical efficacy. These results demonstrate the strong potential of DPO for molecular design tasks and highlight its effectiveness as a robust and efficient solution for data-driven drug discovery.

Paper Structure

This paper contains 22 sections, 7 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Structure of the DPO+Curriculum Learning model. The model is initially pre-trained, followed by optimization using Direct Preference Optimization. As curriculum learning progresses, the molecular scores of the collected compounds steadily increase while the distinction between superior and inferior molecules gradually narrows. Ultimately, the process yields molecules that meet the predefined quality criteria.
  • Figure 2: In the GSK3B+DRD2 docking experiment, the model achieved good performance through curriculum learning. In Course 1, the model learns the fundamental requirements of the task. In Course 2, it fine-tunes the molecular scaffold. In Course 3, it adjusts functional groups to optimize molecular structures.
  • Figure 3: Model performance on Ranolazine MPO and Amlodipine MPO tasks under different numbers of agents. (The curve represents the Top-10 score, while the shaded region indicates the score distribution of the top 100 molecules.)
  • Figure 4: Model performance on Perindopril MPO and Amlodipine MPO tasks under different Sampling-to-Training Ratios. (The curve represents the Top-10 score, while the shaded region indicates the score distribution of the top 100 molecules.)