Table of Contents
Fetching ...

Generative Flow Networks as Entropy-Regularized RL

Daniil Tiapkin, Nikita Morozov, Alexey Naumov, Dmitry Vetrov

TL;DR

This work establishes a direct reduction of Generative Flow Networks (GFlowNets) to entropy-regularized reinforcement learning (MaxEnt RL) for general DAGs, showing that with a fixed backward policy and appropriately structured rewards the optimal soft RL policy coincides with the GFlowNet forward policy. It then demonstrates that existing soft RL algorithms, notably SoftDQN and Munchausen DQN, can be ported to train GFlowNets, interpreting classic TB/DB/SubTB objectives through soft RL lenses. Empirically, Munchausen DQN often matches or surpasses traditional GFlowNet methods on synthetic hypergrid tasks, small molecule generation, and non-autoregressive sequence generation, highlighting the practical viability of RL-based GFlowNet training. The results suggest a unifying perspective where RL principles provide a flexible and scalable toolkit for diverse GFlowNet applications, with potential for further theoretical and algorithmic cross-pollination such as MCTS-inspired approaches.

Abstract

The recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to a general case. We demonstrate how the task of learning a generative flow network can be efficiently redefined as an entropy-regularized RL problem with a specific reward and regularizer structure. Furthermore, we illustrate the practical efficiency of this reformulation by applying standard soft RL algorithms to GFlowNet training across several probabilistic modeling tasks. Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods. This perspective opens a direct path for integrating RL principles into the realm of generative flow networks.

Generative Flow Networks as Entropy-Regularized RL

TL;DR

This work establishes a direct reduction of Generative Flow Networks (GFlowNets) to entropy-regularized reinforcement learning (MaxEnt RL) for general DAGs, showing that with a fixed backward policy and appropriately structured rewards the optimal soft RL policy coincides with the GFlowNet forward policy. It then demonstrates that existing soft RL algorithms, notably SoftDQN and Munchausen DQN, can be ported to train GFlowNets, interpreting classic TB/DB/SubTB objectives through soft RL lenses. Empirically, Munchausen DQN often matches or surpasses traditional GFlowNet methods on synthetic hypergrid tasks, small molecule generation, and non-autoregressive sequence generation, highlighting the practical viability of RL-based GFlowNet training. The results suggest a unifying perspective where RL principles provide a flexible and scalable toolkit for diverse GFlowNet applications, with potential for further theoretical and algorithmic cross-pollination such as MCTS-inspired approaches.

Abstract

The recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to a general case. We demonstrate how the task of learning a generative flow network can be efficiently redefined as an entropy-regularized RL problem with a specific reward and regularizer structure. Furthermore, we illustrate the practical efficiency of this reformulation by applying standard soft RL algorithms to GFlowNet training across several probabilistic modeling tasks. Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods. This perspective opens a direct path for integrating RL principles into the realm of generative flow networks.
Paper Structure (49 sections, 2 theorems, 36 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 49 sections, 2 theorems, 36 equations, 8 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{G} = (\mathcal{S}, \mathcal{E})$ be a DAG with a set of terminal states $\mathcal{X}$, let $\mathcal{P}_{\mathrm{B}}$ be a fixed backward policy and $\mathcal{R}$ be a GFlowNet reward function. Let $\mathcal{M}_{\mathcal{G}} = (\mathcal{S}', \mathcal{A}, \mathrm{P}, r, \gamma,s_0)$ be Then the optimal policy $\pi^\star_1(s'|s)$ for the regularized MDP with coefficient $\lambda=1$ is

Figures (8)

  • Figure 1: $L^1$ distance between target and empirical GFlowNet distributions over the course of training on the hypergrid environment. Top row:$\mathcal{P}_{\mathrm{B}}$ is fixed to be uniform for all methods. Bottom row:$\mathcal{P}_{\mathrm{B}}$ is learnt for the baselines and fixed to be uniform for M-DQN. Mean and std values are computed over 3 runs.
  • Figure 2: Small molecule generation results. Above: Pearson correlation between $\log \mathcal{R}$ and $\log \mathcal{P}_{\theta}$ on a test set for each method and varying $\beta \in \{4, 8, 10, 16\}$. Solid lines represent the best results over choices of learning rate, dashed lines — mean results. Below: Number of Tanimoto-separated modes with $\tilde{\mathcal{R}} > 7.0$ found over the course of training for $\beta = 10$.
  • Figure 3: Bit sequence generation results. Above: Spearman correlation between $\mathcal{R}$ and $\mathcal{P}_{\theta}$ on a test set for each method and varying $k \in \{2, 4, 6, 8, 10\}$. Below: The number of modes discovered over the course of training for $k = 8$.
  • Figure 4: $L^1$ distance between target and empirical GFlowNet distributions over the course of training on the hypergrid environment for the hard reward variant. Top row:$\mathcal{P}_{\mathrm{B}}$ is fixed to be uniform for all methods. Bottom row:$\mathcal{P}_{\mathrm{B}}$ is learnt for the baselines and fixed to be uniform for M-DQN.
  • Figure 5: $L^1$ distance between target and empirical GFlowNet distributions over the course of training on the hypergrid environment. Here we present two versions of M-DQN: one with a fixed uniform $\mathcal{P}_{\mathrm{B}}$ and another one with a learnt $\mathcal{P}_{\mathrm{B}}$ during the training.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Proposition 1