Table of Contents
Fetching ...

Graph Diffusion Policy Optimization

Yijing Liu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, Wei Chen

TL;DR

Experimental results show that GDPO achieves state-of-the-art performance in various graph generation tasks with complex and diverse objectives, and promising improved performance.

Abstract

Recent research has made significant progress in optimizing diffusion models for downstream objectives, which is an important pursuit in fields such as graph generation for drug design. However, directly applying these models to graph presents challenges, resulting in suboptimal performance. This paper introduces graph diffusion policy optimization (GDPO), a novel approach to optimize graph diffusion models for arbitrary (e.g., non-differentiable) objectives using reinforcement learning. GDPO is based on an eager policy gradient tailored for graph diffusion models, developed through meticulous analysis and promising improved performance. Experimental results show that GDPO achieves state-of-the-art performance in various graph generation tasks with complex and diverse objectives. Code is available at https://github.com/sail-sg/GDPO.

Graph Diffusion Policy Optimization

TL;DR

Experimental results show that GDPO achieves state-of-the-art performance in various graph generation tasks with complex and diverse objectives, and promising improved performance.

Abstract

Recent research has made significant progress in optimizing diffusion models for downstream objectives, which is an important pursuit in fields such as graph generation for drug design. However, directly applying these models to graph presents challenges, resulting in suboptimal performance. This paper introduces graph diffusion policy optimization (GDPO), a novel approach to optimize graph diffusion models for arbitrary (e.g., non-differentiable) objectives using reinforcement learning. GDPO is based on an eager policy gradient tailored for graph diffusion models, developed through meticulous analysis and promising improved performance. Experimental results show that GDPO achieves state-of-the-art performance in various graph generation tasks with complex and diverse objectives. Code is available at https://github.com/sail-sg/GDPO.
Paper Structure (24 sections, 15 equations, 7 figures, 7 tables)

This paper contains 24 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of GDPO. (1) In each optimization step, GDPO samples multiple generation trajectories from the current Graph DPM and queries the reward function with different $\bm{G}_0$. (2) For each trajectory, GDPO accumulates the gradient $\nabla_\theta \log p_\theta(\bm{G}_0|\bm{G}_t)$ of each $(\bm{G}_0, \bm{G}_t)$ pair and assigns a weight to the aggregated gradient based on the corresponding reward signal. Finally, GDPO estimates the eager policy gradient by averaging the aggregated gradient from all trajectories.
  • Figure 2: Toy experiment comparing DDPO and GDPO. We generate connected graphs with increasing number of nodes. Node categories are disregarded, and the edge categories are binary, indicating whether two nodes are linked. The graph DPM is initialized randomly as a one-layer graph transformer from DiGress Vignac2022DiGressDD. The diffusion step $T$ is set to $50$, and the reward signal $r(\bm{G}_0)$ is defined as $1$ if $\bm{G}_0$ is connected and $0$ otherwise. We use $256$ trajectories for gradient estimation in each update. The learning curve illustrates the diminishing performance of DDPO as the number of nodes increases, while GDPO consistently performs well.
  • Figure 3: We investigate two key factors of GDPO on ZINC250k, with the target protein being 5ht1b. Similarly, the vertical axis represents the total queries, while the horizontal axis represents the average reward.(a) We vary the number of trajectories for gradient estimation. (b) We fix the weight of $r_{\textsc{QED}}$ and $r_{\textsc{SA}}$, and change the weight of $r_{\textsc{NOV}}$ while ensuring the total weight is 1.
  • Figure 4: Graph Diffusion Policy Optimization
  • Figure 5: We investigate the L2 distance between two consecutive steps in two types of DPMs. The diffusion step is 1000 for two models.
  • ...and 2 more figures