Avoid What You Know: Divergent Trajectory Balance for GFlowNets

Pedro Dall'Antonia; Tiago da Silva; Daniel Csillag; Salem Lahlou; Diego Mesquita

Avoid What You Know: Divergent Trajectory Balance for GFlowNets

Pedro Dall'Antonia, Tiago da Silva, Daniel Csillag, Salem Lahlou, Diego Mesquita

TL;DR

This work proposes Adaptive Complementary Exploration (ACE), a principled algorithm for the effective exploration of novel and high-probability regions when learning GFlowNets and introduces an exploration GFlowNet explicitly trained to search for high-reward states in regions underexplored by the canonical GFlowNet, which learns to sample from the target distribution.

Abstract

Generative Flow Networks (GFlowNets) are a flexible family of amortized samplers trained to generate discrete and compositional objects with probability proportional to a reward function. However, learning efficiency is constrained by the model's ability to rapidly explore diverse high-probability regions during training. To mitigate this issue, recent works have focused on incentivizing the exploration of unvisited and valuable states via curiosity-driven search and self-supervised random network distillation, which tend to waste samples on already well-approximated regions of the state space. In this context, we propose Adaptive Complementary Exploration (ACE), a principled algorithm for the effective exploration of novel and high-probability regions when learning GFlowNets. To achieve this, ACE introduces an exploration GFlowNet explicitly trained to search for high-reward states in regions underexplored by the canonical GFlowNet, which learns to sample from the target distribution. Through extensive experiments, we show that ACE significantly improves upon prior work in terms of approximation accuracy to the target distribution and discovery rate of diverse high-reward states.

Avoid What You Know: Divergent Trajectory Balance for GFlowNets

TL;DR

Abstract

Paper Structure (17 sections, 4 theorems, 43 equations, 10 figures, 1 algorithm)

This paper contains 17 sections, 4 theorems, 43 equations, 10 figures, 1 algorithm.

Introduction
Preliminaries and Related Works
GFlowNets
Learning GFlowNets
Adaptive Complementary Exploration
Experiments
Discussion
Proofs
Proof of Proposition \ref{['prop:complementary']}
Proof of Proposition \ref{['prop:repulsive_bound']}
Proof of Proposition \ref{['th:Divergent_LossAnti-Collapse']}
Proof of Proposition \ref{['prop:equilibrium_state']}
Related works
Experimental details
Optimization & Architecture
...and 2 more sections

Key Result

Proposition 3.4

Assume $\mathcal{L}_{\hbox{$\nabla$}}(\mathfrak{g}_{{\hbox{$\nabla$}}} ; \tau, \alpha) = 0$ for each trajectory $\tau$ starting at $s_{o}$ and finishing at a terminal state in $\mathcal{X}$ and $\mathrm{UA}(\alpha) \neq \emptyset$. Then, the marginal $p_{\top}^{\nabla}$ of $p_{F}^{\hbox{$\nabla$}}$ with normalizing constant $\!Z_{{\hbox{$\nabla$}}} = \sum_{x \in \mathrm{UA}(\alpha)} R(x)^{\beta}$

Figures (10)

Figure 1: When learning the exploratory policy based on a combination of intrinsic and extrinsic rewards---see \ref{['eq:tsas', 'eq:sas']}---, the model may overemphasize well-learned states (bottom row). In contrast, ACE avoids sampling trajectories from over-explored regions of the state space by design (top row), which improves mode discovery and accelerates learning convergence (rightmost panel) to the target. This figure shows the marginal distribution of forward and exploratory policies at different training points (marked as dashed vertical lines).
Figure 2: A GFlowNet trained on the Rings distribution (left) via $\epsilon$-greedy exploration may overdraw samples from a well-approximated region (polygon), misrepresenting other high-probability regions. The TB residual on the rightmost panel for the inner (top) and outer (bottom) rings shows ACE avoids this issue.
Figure 3: ACE significantly accelerates mode-discovery for autoregressive sequence generation with GFlowNets. Each plot shows the average reward of the unique 200 highest-valued discovered states as a function of the number of trajectories sampled throughout training.
Figure 4: ACE finds diverse and high-reward states faster than prior approaches for improved GFlowNet exploration for the bag generation (left) and quadratic knapsack (right) problems. In both (a) and (b), $K$ denotes the number of available items for selection.
Figure 5: ACE results in faster learning convergence than prior art for GFlowNet exploration for the Lazy Random Walk task. Please consult \ref{['fig:hypergrids']} below for the legend.
...and 5 more figures

Theorems & Definitions (9)

Definition 3.1: Over- & Under-Allocated regions
Definition 3.2: DTB
Definition 3.3: DTB Loss
Proposition 3.4: Complementary Sampling Property
Proposition 3.5: Repulsive Bound
Definition 3.6: Canonical Loss
Remark 3.7: Notation for the loss functions
Proposition 3.8
Proposition 3.9: Equilibrium State

Avoid What You Know: Divergent Trajectory Balance for GFlowNets

TL;DR

Abstract

Avoid What You Know: Divergent Trajectory Balance for GFlowNets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (9)