Maximum entropy GFlowNets with soft Q-learning

Sobhan Mohammadpour; Emmanuel Bengio; Emma Frejinger; Pierre-Luc Bacon

Maximum entropy GFlowNets with soft Q-learning

Sobhan Mohammadpour, Emmanuel Bengio, Emma Frejinger, Pierre-Luc Bacon

TL;DR

This work builds a bridge between entropy-regularized reinforcement learning and Generative Flow Networks by designing a reward that yields sampling proportional to an unnormalized target $ ilde{p}$ under the soft Bellman equations. It introduces generative soft Q-learning (GSQL) and the maximum entropy GFN (max-ent GFN), where the backward policy is $q(s,a|s')=\frac{n(s)}{n(s')}$ and entropy is maximized over feasible flows, guaranteeing the maximum achievable flow entropy in general. The authors show that $\log n$ can be learned via the inverted MDP and that PCL and trajectory/balance constraints align under this framework, yielding a unique, high-entropy solution. Empirically, max-ent GFNs improve exploration and mode coverage on structured MDPs, including tree- and graph-building tasks like sEH and QM9, while GSQL may fail on larger combinatorial spaces. The results highlight the practical viability of leveraging entropy-regularized RL tools for GFNs and point to broad applicability in combinatorial sampling and molecule design.

Abstract

Generative Flow Networks (GFNs) have emerged as a powerful tool for sampling discrete objects from unnormalized distributions, offering a scalable alternative to Markov Chain Monte Carlo (MCMC) methods. While GFNs draw inspiration from maximum entropy reinforcement learning (RL), the connection between the two has largely been unclear and seemingly applicable only in specific cases. This paper addresses the connection by constructing an appropriate reward function, thereby establishing an exact relationship between GFNs and maximum entropy RL. This construction allows us to introduce maximum entropy GFNs, which, in contrast to GFNs with uniform backward policy, achieve the maximum entropy attainable by GFNs without constraints on the state space.

Maximum entropy GFlowNets with soft Q-learning

TL;DR

This work builds a bridge between entropy-regularized reinforcement learning and Generative Flow Networks by designing a reward that yields sampling proportional to an unnormalized target

under the soft Bellman equations. It introduces generative soft Q-learning (GSQL) and the maximum entropy GFN (max-ent GFN), where the backward policy is

and entropy is maximized over feasible flows, guaranteeing the maximum achievable flow entropy in general. The authors show that

can be learned via the inverted MDP and that PCL and trajectory/balance constraints align under this framework, yielding a unique, high-entropy solution. Empirically, max-ent GFNs improve exploration and mode coverage on structured MDPs, including tree- and graph-building tasks like sEH and QM9, while GSQL may fail on larger combinatorial spaces. The results highlight the practical viability of leveraging entropy-regularized RL tools for GFNs and point to broad applicability in combinatorial sampling and molecule design.

Abstract

Paper Structure (23 sections, 36 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 23 sections, 36 equations, 5 figures, 10 tables, 1 algorithm.

INTRODUCTION
BACKGROUND AND NOTATION
GENERATIVE FLOW NETWORKS
SOFT Q-LEARNING
FROM SOFT Q-LEARNING TO MAXIMUM ENTROPY GFNs
A different definition of flow entropy
The backward of GSQL
Remarks on PCL
EXPERIMENTS
A simple MDP
Hypergrid
Molecule design
CONCLUSION
GENERATIVE FLOW NETWORKS
Multiple solutions for GFNs.
...and 8 more sections

Figures (5)

Figure 1: Comparison of maximum entropy and uniform backward. Left: uniform backward policy, middle: MDP, right: maximum entropy gflownet. The numbers are the probabilities of the policies at state $s_0$ and $s_2$.
Figure 2: From left to right, target, $l=$, and marginal of the uniform backward and maximum entropy GFNs for the $64^2$ grid. Note the log scale colors for $\mu$ and the non-smooth partitioning of the flow around the bottom and left edges with the uniform backward policy.
Figure 4: Experiment statistics. Confidence intervals show the IQM. From top to bottom, the rows belong to the sEH and QM9 experiments.
Figure 5: MSE of the learned $n$ and the ground truth in the sEH experiments.
Figure 6: Proof that the uniform backward is not maximum entropy on tree-building environments. Left and right are the parents of the middle tree. Assuming nodes are unique, the left tree has 12 trajectories that reach it, while the right tree has 8.

Theorems & Definitions (20)

proof
proof
proof
definition 1
proof
definition 2
proof
proof
proof
definition 3
...and 10 more

Maximum entropy GFlowNets with soft Q-learning

TL;DR

Abstract

Maximum entropy GFlowNets with soft Q-learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (20)