Table of Contents
Fetching ...

MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

Philippe Formont, Maxime Darrin, Ismail Ben Ayed, Pablo Piantanida

Abstract

Recent advances in reasoning-based large language models (LLMs) have demonstrated substantial improvements in complex problem-solving tasks. Motivated by these advances, several works have explored the application of reasoning LLMs to drug discovery and molecular design. However, most existing approaches either focus on evaluation or rely on training setups that require ground-truth labels, such as molecule pairs with known property modifications. Such supervision is unavailable in \textit{de novo} molecular generation, where the objective is to generate novel molecules that optimize a desirability score without prior knowledge of high-scoring candidates. To bridge this gap, we introduce MolRGen, a large-scale benchmark and dataset for training and evaluating reasoning-based LLMs on \textit{de novo} molecular generation. Our contributions are threefold. First, we propose a setting to evaluate and train models for \textit{de novo} molecular generation and property prediction. Second, we introduce a novel diversity-aware top-$k$ score that captures both the quality and diversity of generated molecules. Third, we show our setting can be used to train LLMs for molecular generation, training a 24B LLM with reinforcement learning, and we provide a detailed analysis of its performance and limitations.

MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

Abstract

Recent advances in reasoning-based large language models (LLMs) have demonstrated substantial improvements in complex problem-solving tasks. Motivated by these advances, several works have explored the application of reasoning LLMs to drug discovery and molecular design. However, most existing approaches either focus on evaluation or rely on training setups that require ground-truth labels, such as molecule pairs with known property modifications. Such supervision is unavailable in \textit{de novo} molecular generation, where the objective is to generate novel molecules that optimize a desirability score without prior knowledge of high-scoring candidates. To bridge this gap, we introduce MolRGen, a large-scale benchmark and dataset for training and evaluating reasoning-based LLMs on \textit{de novo} molecular generation. Our contributions are threefold. First, we propose a setting to evaluate and train models for \textit{de novo} molecular generation and property prediction. Second, we introduce a novel diversity-aware top- score that captures both the quality and diversity of generated molecules. Third, we show our setting can be used to train LLMs for molecular generation, training a 24B LLM with reinforcement learning, and we provide a detailed analysis of its performance and limitations.
Paper Structure (71 sections, 2 theorems, 14 equations, 16 figures, 5 tables)

This paper contains 71 sections, 2 theorems, 14 equations, 16 figures, 5 tables.

Key Result

Proposition A.1

Fix a prompt $q$ and suppose there exists $\hat{o}^\star\in\mathcal{A}$ and $\Delta>0$ such that $A_q(\hat{o}^\star)\ge A_q(\hat{o})+\Delta$ for all $\hat{o}\neq \hat{o}^\star$. Let $\{\pi_t(\cdot\mid q)\}_{t\ge 0}$ be the sequence obtained by repeatedly applying equation eq:exp_update_app with fixe In particular, $\pi_t(\hat{o}^\star\mid q)\to 1$ as $t\to\infty$, and the entropy $H(\pi_t(\cdot\mi

Figures (16)

  • Figure 1: Diversity-aware top-k score. Evaluation of the diversity-aware top-k score (y-axis) against varying similarity thresholds (x-axis) between candidate clusters.
  • Figure 2: Property prediction performances. Accuracy of the LLMs on classification tasks (left), and normalized Spearman correlation on regression tasks (right).
  • Figure 3: Overview of the target proteins. (a) Function of the proteins extracted from the PDB, our dataset comprises 21 molecular functions with at least 10 targets, the majority of which are kinases (30%). (b) Annotation score of the proteins on UniProt (from 1 to 5). The vast majority of the target proteins are high quality protein with strong evidence on their existence.
  • Figure 4: Task sizes in the molecular property prediction objectives. The vast majority of tasks consist of regression tasks, and the largest benchmark used is the TDC benchmark.
  • Figure 5: Scaffold occurrence in the various benchmarks. Occurrences of the most frequent Murcko scaffolds (of at least 6 atoms) in each benchmark, illustrating the chemical diversity across tasks.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Proposition A.1: Exponential amplification of probability ratios
  • Lemma A.2: Coverage from a cluster-mass floor