Table of Contents
Fetching ...

Evaluating GFlowNet from partial episodes for stable and flexible policy-based training

Puhua Niu, Shili Wu, Xiaoning Qian

TL;DR

This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator.

Abstract

Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.

Evaluating GFlowNet from partial episodes for stable and flexible policy-based training

TL;DR

This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator.

Abstract

Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.
Paper Structure (46 sections, 4 theorems, 45 equations, 17 figures, 5 tables, 4 algorithms)

This paper contains 46 sections, 4 theorems, 45 equations, 17 figures, 5 tables, 4 algorithms.

Key Result

Theorem 3.1

Suppose $V$ is an arbitrary evaluation function over $\mathcal{S}$, and $F^\ast$ is the optimal flow induced by a backward policy $\pi_B$. Given a forward policy $\pi_F$, if and only if $V$ satisfies the Sub-EB condition (Sub-EB).

Figures (17)

  • Figure 1: A graphical illustration of a DAG (left) and its graded version (right). Dotted circles represent dummy states, added during the conversion to a graded DAG.
  • Figure 2: Plots of the means and standard deviations (represented by the shaded area) of $D_{\mathrm{TV}}$ for different training methods with parameterized $\pi_B$ and uniform $\pi_B$ on the $256 \times 256$ (left) and $128 \times 128$ (middle) and $64\times 64 \times 64$ (right) grids, based on five randomly started runs for each method. By default, metric values are recorded every 20 iterations over $N = 2000$ training iterations and smoothed by a sliding window of length 5 for all plotted curves in this paper.
  • Figure 3: Plots of the mean and standard deviation values (represented by the shaded area) of average reward (left), diversity (right) and FCS (right) of the top 100 unique candidate graphs over 10 nodes, based on five randomly started runs for each method.
  • Figure 4: Plots of the means and standard deviations (represented by the shaded area) of $D_{\mathrm{JSD}}$ for different training methods on the $256\times 256$ (left) $128\times 128$ (middle) and $64 \times 64 \times 64$ (right) grids, based on five randomly started runs for each method.
  • Figure 5: Plots of the means and standard deviations (represented by the shaded area) of $D_{\mathrm{TV}}$ (right) for different training methods on the $128\times 128$ (left) $20\times 20$ (right) grids, based on five randomly started runs for Sub-TB-16 Sub-TB-128, Q-much-16 and Q-much 128. Here, “16’’ and “128’’ denote the training batch sizes.
  • ...and 12 more figures

Theorems & Definitions (8)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • proof
  • proof
  • proof
  • Corollary A.1: Corollary to Theorem \ref{['eb-the']}
  • proof