Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

Baoheng Zhu; Deyu Bo; Delvin Ce Zhang; Xiao Wang

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

Baoheng Zhu, Deyu Bo, Delvin Ce Zhang, Xiao Wang

TL;DR

This paper proposes Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards that achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.

Abstract

Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0\% and 97.5\% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

TL;DR

Abstract

Paper Structure (48 sections, 1 theorem, 29 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 48 sections, 1 theorem, 29 equations, 9 figures, 7 tables, 2 algorithms.

Introduction
Preliminary
Discrete Flow Matching
Training and Sampling
The Proposed Method
Estimation of Rate Matrix
Graph-GRPO
Refinement
Experiment
Experimental Setup
General Graph Generation
Protein Docking
Target Property Optimization
Ablation Study
Visualization
...and 33 more sections

Key Result

Proposition 3.1

Given the current state $z_t$, current time $t$, prior distribution $p_0$, and model prediction $p_{\theta}(\cdot|z_t)$, the analytic rate matrix is defined as: where $V_1$ and $V_2$ are two statistics that can be pre-calculated before generation.

Figures (9)

Figure 1: Reward curves on two molecular optimization tasks. We use DeFoG DeFoG as the base model. As the number of oracle calls increases, the score of RL-optimized models gradually rises, while the base model remains almost unchanged. In the tasks with highly selective reward, e.g., Valsartan SMARTS, refining promising candidates is more effective than de novo generation.
Figure 2: The overall framework of Graph-GRPO. (1) Rollout: given a noisy graph, the policy model samples $K$ denoising trajectories and caches the graph state $z_{t+\Delta t}$ along with its transition probability $p_{t+\Delta t}^{\text{old}}$. In the meantime, we normalize the reward scores to calculate the advantage of each trajectory. (2) RL training: we select a graph $G_t$ from the rollouts and use the new policy model to estimate the transition probabilities $p_{t+\Delta t}$. Graph-GRPO maximizes the rewards by optimizing the advantage-weighted ratio in Eq. \ref{['eq: grpo']}.
Figure 3: Refinement in Graph-GRPO. We first use GFMs to denoise a noise graph from $t=0$ to $t=1$. Subsequently, we re-noise the generated graph to time step $t_{\epsilon}$ and denoise it again using GFMs. This strategy explicitly increases the denoising steps of GFMs and improves the generation quality of Graph-GRPO.
Figure 4: Visualization of the sampling trajectory. From left to right, the graphs $G_t$ move from a prior distribution ($t=0$) to the data distribution ($t=1$). Visualizations of the predicted clean state $\hat{z}_1$ at the corresponding time steps are provided in Figure \ref{['fig:vis_pred_z1']}.
Figure 5: Visualization of training and evaluation curves in Graph-GRPO. We select five representative tasks from the PMO benchmark. (a) Training reward curves plotted against training steps. (b) Average top-10 scores plotted against the number of oracle calls. For brevity, we only visualize the first 5,000 oracle calls.
...and 4 more figures

Theorems & Definitions (2)

Proposition 3.1
proof

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

TL;DR

Abstract

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)