Table of Contents
Fetching ...

Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization

Timofei Gritsaev, Nikita Morozov, Sergey Samsonov, Daniil Tiapkin

TL;DR

This work addresses the limitation of fixed backward policies in GFlowNets by introducing Trajectory Likelihood Maximization (TLM), a principled backward-policy optimization that alternates between maximizing the backward trajectory likelihood and optimizing the forward policy under an entropy-regularized RL objective with non-stationary rewards. The method integrates with existing GFlowNet algorithms (e.g., TB, DB, SubTB, SoftDQN) and provides convergence guarantees under stability and diminishing non-stationary regret. Empirically, TLM accelerates convergence and improves mode discovery across Hypergrid, Bit Sequences, and QM9+sEH molecule design tasks, though benefits can vary with environment structure. Overall, TLM offers a versatile, easy-to-implement improvement to backward policy optimization, with strong performance in less-structured domains and actionable stability strategies for training.

Abstract

Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.

Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization

TL;DR

This work addresses the limitation of fixed backward policies in GFlowNets by introducing Trajectory Likelihood Maximization (TLM), a principled backward-policy optimization that alternates between maximizing the backward trajectory likelihood and optimizing the forward policy under an entropy-regularized RL objective with non-stationary rewards. The method integrates with existing GFlowNet algorithms (e.g., TB, DB, SubTB, SoftDQN) and provides convergence guarantees under stability and diminishing non-stationary regret. Empirically, TLM accelerates convergence and improves mode discovery across Hypergrid, Bit Sequences, and QM9+sEH molecule design tasks, though benefits can vary with environment structure. Overall, TLM offers a versatile, easy-to-implement improvement to backward policy optimization, with strong performance in less-structured domains and actionable stability strategies for training.

Abstract

Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.

Paper Structure

This paper contains 23 sections, 1 theorem, 18 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Assume that (1) the backward updates are stable, i.e., $\sup_{t \geq 0} \lVert\mathcal{P}_{\mathrm{B}}^T - \mathcal{P}_{\mathrm{B}}^{T+t}\rVert_1 \to 0$ as $T \to \infty$, and (2) the forward updates follow a non-stationary regret minimization algorithm, i.e., $\overline{\mathfrak{R}}^T \to 0$ as $T

Figures (6)

  • Figure 1: $L^1$ distance between target and empirical sample distributions over the course of training on the standard (top row) and hard (bottom row) hypergrid environments for each method. Lower values indicate better performance.
  • Figure 2: Top row: Bit Sequences, the number of discovered modes out of a total of 60 modes for different methods. Center row: QM9, the number of Tanimoto-separated modes with reward higher or equal to $1.125$ for different methods. Bottom row: sEH, the number of Tanimoto-separated modes with reward higher or equal to $0.875$ for different methods. Higher values indicate better performance. For each pair of a GFlowNet algorithm and a backward approach, the results are presented for the best learning rate chosen in terms of the total number of discovered modes.
  • Figure 3: Top row: Bit Sequences, Spearman correlation between $\mathcal{R}$ and $P_\theta$ on a test set for different methods and varying learning rate $\in \{ 5 \cdot 10^{-4}, 10^{-3}, 2 \cdot 10^{-3} \}$. Center row: QM9, Pearson correlation between $\log \mathcal{R}$ and $\log P_\theta$ on the fixed subset of the QM9 dataset ramakrishnan2014qm9 for different methods and varying learning rate $\in \{ 5 \cdot 10^{-5}, 10^{-4}, 5 \cdot 10^{-4}, 10^{-3} \}$. Bottom row: sEH, Pearson correlation between $\log \mathcal{R}$ and $\log P_\theta$ on the test set from bengio2021flow for different methods and varying learning rate $\in \{ 5 \cdot 10^{-5}, 10^{-4}, 5 \cdot 10^{-4}, 10^{-3} \}$.Higher values indicate better performance. We note here that pessimistic backward policy can be very sensitive to the choice of learning rate.
  • Figure 4: Ablation study of stability techniques on QM9. The number of Tanimoto-separated modes with a reward at least $1.125$ is shown. As a base algorithm, we use DB with a learning rate of $5 \cdot 10^{-4}$.
  • Figure 5: Bit Sequences, the number of modes discovered over the course of training for different methods and a learning rate $\in \{ 5 \cdot 10^{-4}, 10^{-3}, 2 \cdot 10^{-3} \}$. Some results for the learning rates of $10^{-3}$ and $2 \cdot 10^{-3}$ are not full because of exploding gradients at certain points in training.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof : Proof of Theorem \ref{['th:convergence']}