Pessimistic Backward Policy for GFlowNets

Hyosoon Jang; Yunhui Jang; Minsu Kim; Jinkyoo Park; Sungsoo Ahn

Pessimistic Backward Policy for GFlowNets

Hyosoon Jang, Yunhui Jang, Minsu Kim, Jinkyoo Park, Sungsoo Ahn

TL;DR

This work extensively evaluates PBP-GFN across eight benchmarks, and proposes a pessimistic backward policy for GFlowNets (PBP-GFN), which maximizes the observed flow to align closely with the true reward for the object.

Abstract

This paper studies Generative Flow Networks (GFlowNets), which learn to sample objects proportionally to a given reward function through the trajectory of state transitions. In this work, we observe that GFlowNets tend to under-exploit the high-reward objects due to training on insufficient number of trajectories, which may lead to a large gap between the estimated flow and the (known) reward value. In response to this challenge, we propose a pessimistic backward policy for GFlowNets (PBP-GFN), which maximizes the observed flow to align closely with the true reward for the object. We extensively evaluate PBP-GFN across eight benchmarks, including hyper-grid environment, bag generation, structured set generation, molecular generation, and four RNA sequence generation tasks. In particular, PBP-GFN enhances the discovery of high-reward objects, maintains the diversity of the objects, and consistently outperforms existing methods.

Pessimistic Backward Policy for GFlowNets

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 12 figures, 1 algorithm)

This paper contains 16 sections, 8 equations, 12 figures, 1 algorithm.

Introduction
Preliminaries
Method
Motivation: under-exploitation of objects with partially observed trajectorie
Pessimistic backward policy for GFlowNets
Related work
Experiment
Synthetic tasks
Molecular generation
Sequence generation
Ablation studies
Conclusion
Proof for error bound
Experimental details
Exploration-exploitation trade-off
...and 1 more sections

Figures (12)

Figure 1: Flow matching for observed trajectories.(a) The task aims to reach the terminal state with a reward-proportional probability from the initial state, by incrementing one coordinate as a random action. The black line indicates the two observed trajectories for each terminal state. (b-c) The arrow ($\rightarrow$) length indicates the amount of the backward or forward flow. In (b), the flow matching ($\approx$) between the observed backward and forward flows underestimates the high-reward object due to the low observed backward flow. In (c), PBP-GFN succeeds with the observed backward flow that fully represents the true rewards.
Figure 2: Under-exploitation of objects with partially observed trajectories. The reward $R(x)$ consists of (1) observed backward flow $R{_\mathcal{B}}(x)$ and (2) unobserved backward flow$R(x)-R_\mathcal{B}(x)$. (a) Conventional flow matching may assign a higher probability to the lower-reward object as the observed forward flow is aligned only with a small amount of observed backward flow. This fails to assign the accurate probability proportional to the reward. (b) PBP-GFN assigns more accurate probability proportional to the reward, by increasing the proportion of observed flow.
Figure 3: Pessimistic backward policy for GFlowNets (PBP-GFN). The portion of the circle indicates the amount of flow, e.g., $\CIRCLE$ indicates the flow of 1, and $\RIGHTcircle$ indicates the half flow of $\CIRCLE$, i.e., the flow of 0.5. Additionally, the color of the flow indicates the flow inducing the same-colored reward, and the black and gray lines indicate the observed and unobserved trajectories, respectively. (a) Flow matching succeeds with the entire trajectories. One can observe that the true reward of $x_1$ is 1 and the reward of $x_2$ is 0.5 by the amount of flow. (b) Flow matching fails with partially observed trajectories. (c) PBP-GFN assigns high probabilities to the backward transitions of observed trajectories to keep a high probability to high-reward objects.
Figure 4: The target distribution and empirical distributions of each model trained with 10$^5$ trajectories. The empirical distributions are computed as rescaled products of the distribution over three runs. Our method (PBP-GFN) consistently discovers all modes over three runs and learns the target Boltzmann distribution correctly within the relatively small number of trajectories.
Figure 5: The performance comparison with the prior backward policy design methods. The solid line and shaded region represent the mean and standard deviation, respectively. The PBP-GFN shows superiority in generating diverse high-reward objects, compared to the considered baselines for designing the backward policy.
...and 7 more figures

Theorems & Definitions (1)

Example 1

Pessimistic Backward Policy for GFlowNets

TL;DR

Abstract

Pessimistic Backward Policy for GFlowNets

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (1)