Table of Contents
Fetching ...

Order-Preserving GFlowNets

Yihang Chen, Lukas Mauch

TL;DR

OP-GFNs address the core limitation of traditional GFlowNets by learning a reward that preserves a (partial) order over candidates rather than relying on a predefined scalar. The framework splits training into an order-preserving loss and MDP constraint losses, yielding a reward landscape that becomes sparser around top candidates as training progresses, which balances exploration early and exploitation later. Theoretical results show the learned reward concentrates on higher-ranked substructures, and extensive experiments across HyperGrid, molecular design, and NAS demonstrate state-of-the-art performance in both single-objective maximization and multi-objective Pareto front approximation, without requiring scalarization. This approach offers a practical, scalable path for diverse candidate generation in expensive or partially observable objective settings, with broad implications for automated design tasks.

Abstract

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates with probabilities proportional to a given reward. However, GFlowNets can only be used with a predefined scalar reward, which can be either computationally expensive or not directly accessible, in the case of multi-objective optimization (MOO) tasks for example. Moreover, to prioritize identifying high-reward candidates, the conventional practice is to raise the reward to a higher exponent, the optimal choice of which may vary across different environments. To address these issues, we propose Order-Preserving GFlowNets (OP-GFNs), which sample with probabilities in proportion to a learned reward function that is consistent with a provided (partial) order on the candidates, thus eliminating the need for an explicit formulation of the reward function. We theoretically prove that the training process of OP-GFNs gradually sparsifies the learned reward landscape in single-objective maximization tasks. The sparsification concentrates on candidates of a higher hierarchy in the ordering, ensuring exploration at the beginning and exploitation towards the end of the training. We demonstrate OP-GFN's state-of-the-art performance in single-objective maximization (totally ordered) and multi-objective Pareto front approximation (partially ordered) tasks, including synthetic datasets, molecule generation, and neural architecture search.

Order-Preserving GFlowNets

TL;DR

OP-GFNs address the core limitation of traditional GFlowNets by learning a reward that preserves a (partial) order over candidates rather than relying on a predefined scalar. The framework splits training into an order-preserving loss and MDP constraint losses, yielding a reward landscape that becomes sparser around top candidates as training progresses, which balances exploration early and exploitation later. Theoretical results show the learned reward concentrates on higher-ranked substructures, and extensive experiments across HyperGrid, molecular design, and NAS demonstrate state-of-the-art performance in both single-objective maximization and multi-objective Pareto front approximation, without requiring scalarization. This approach offers a practical, scalable path for diverse candidate generation in expensive or partially observable objective settings, with broad implications for automated design tasks.

Abstract

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates with probabilities proportional to a given reward. However, GFlowNets can only be used with a predefined scalar reward, which can be either computationally expensive or not directly accessible, in the case of multi-objective optimization (MOO) tasks for example. Moreover, to prioritize identifying high-reward candidates, the conventional practice is to raise the reward to a higher exponent, the optimal choice of which may vary across different environments. To address these issues, we propose Order-Preserving GFlowNets (OP-GFNs), which sample with probabilities in proportion to a learned reward function that is consistent with a provided (partial) order on the candidates, thus eliminating the need for an explicit formulation of the reward function. We theoretically prove that the training process of OP-GFNs gradually sparsifies the learned reward landscape in single-objective maximization tasks. The sparsification concentrates on candidates of a higher hierarchy in the ordering, ensuring exploration at the beginning and exploitation towards the end of the training. We demonstrate OP-GFN's state-of-the-art performance in single-objective maximization (totally ordered) and multi-objective Pareto front approximation (partially ordered) tasks, including synthetic datasets, molecule generation, and neural architecture search.
Paper Structure (91 sections, 6 theorems, 46 equations, 21 figures, 6 tables, 1 algorithm)

This paper contains 91 sections, 6 theorems, 46 equations, 21 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

For $\{x_i\}_{i=0}^n\in \mathcal{X}$, assume that $u(x_i)< u(x_j), 0\leq i< j\leq n$. The order-preserving reward $\widehat{R}(x)\in[1/\gamma,1]$ is defined by the reward function that minimizes the order-preserving loss for neighboring pairs $\mathcal{L}_{\rm OP-N}$, i.e., We have $\widehat{R}(x_i) = \gamma^{i/n-1},0\leq i\leq n$, and $\mathcal{L}_{\rm OP-N}(\{x_i\}_{i=0}^n; \widehat{R})=n\log(1

Figures (21)

  • Figure 4.1: Molecular design: In the environment Bag, QM9, sEH, TFBind8, TFBind10, we test our algorithm (OP-TB) against previous GFN methods (MaxEnt, TB, DB, subTB, GTB), and (RL-)sampling methods (MARS, A2C, SQL, PPO).
  • Figure 4.2: Multi-trial training of a GFlowNet sampler. Best test accuracy at epoch 12 and 200 of random baseline (Random), GFlowNet methods (TB, OP-TB, OP-TB-KL, OP-TB-KL-AUG), and other multi-trial algorithms (REA, BOHB, REINFORCE).
  • Figure 5.1: Reward Distribution: We plot the indicator function of the true Pareto front solutions and the learned reward distribution of the OP-GFNs and PC-GFNs.
  • Figure 5.2: Fragment-Based Molecule Generation: We plot the estimated Pareto front of the generated samples in $[0,1]^2$. The $x$-, $y$-axis are the first, and second objective in the title of respectively.
  • Figure E.1: HyperGrid with $D=2, H=64$, different $R_0=0.1,0.01, 0.001$. $P_B$ is trainable with KL regularization weight $\lambda_{\rm KL}=0,0.01, 0.1, 1, 10$, or $P_B$ is fixed.
  • ...and 16 more figures

Theorems & Definitions (11)

  • Proposition 1: Mutually different
  • Proposition 2: Informal
  • Definition 1: Sequence prepend/append MDP
  • Proposition 3
  • Proposition 4: Mutually different
  • proof : Proof of \ref{['app_prop:evenly_R']}
  • Proposition 5
  • proof : Proof of \ref{['app_prop:piece_R']}
  • Remark
  • Proposition 6
  • ...and 1 more