Table of Contents
Fetching ...

Discrete Probabilistic Inference as Control in Multi-path Environments

Tristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, Yoshua Bengio

TL;DR

This work addresses the challenge of sampling from discrete, structured distributions by casting probabilistic inference as a finite-horizon sequential decision problem. It builds a principled bridge between Generative Flow Networks (GFlowNets) and Maximum Entropy RL (MaxEnt RL) through a reward-correction framework that ensures terminating-state distributions match the target Gibbs distribution $P(x) \propto \exp(-{\mathcal{E}}(x)/\alpha)$. The authors establish formal equivalences between core objectives across the two paradigms, including Path Consistency Learning (PCL) with Subtrajectory Balance (SubTB), Soft Q-Learning with Modified DB, and Forward-Looking DB, and they verify these connections empirically on factor-graph inference, Bayesian structure learning, and phylogenetic tree generation. The results enable a unified, flow-based approach to probabilistic inference in large, discrete spaces, offering a principled path to leverage RL techniques for diverse inference tasks and combinatorial generation problems.

Abstract

We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.

Discrete Probabilistic Inference as Control in Multi-path Environments

TL;DR

This work addresses the challenge of sampling from discrete, structured distributions by casting probabilistic inference as a finite-horizon sequential decision problem. It builds a principled bridge between Generative Flow Networks (GFlowNets) and Maximum Entropy RL (MaxEnt RL) through a reward-correction framework that ensures terminating-state distributions match the target Gibbs distribution . The authors establish formal equivalences between core objectives across the two paradigms, including Path Consistency Learning (PCL) with Subtrajectory Balance (SubTB), Soft Q-Learning with Modified DB, and Forward-Looking DB, and they verify these connections empirically on factor-graph inference, Bayesian structure learning, and phylogenetic tree generation. The results enable a unified, flow-based approach to probabilistic inference in large, discrete spaces, offering a principled path to leverage RL techniques for diverse inference tasks and combinatorial generation problems.

Abstract

We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.
Paper Structure (33 sections, 9 theorems, 44 equations, 8 figures, 2 tables)

This paper contains 33 sections, 9 theorems, 44 equations, 8 figures, 2 tables.

Key Result

Theorem 3.1

Let $P_{B}(\cdot\mid s')$ be an arbitrary backward transition probability (i.e., a distribution over the parents of $s'\neq s_{0}$ in ${\mathcal{G}}$). Let $r(s, s')$ be the reward function of the MDP corrected with $P_{B}$, satisfying for any trajectory $\tau = (s_{0}, s_{1}, \ldots, s_{T}, s_{f})$ where we used the convention $s_{T+1} = s_{f}$. Then the terminating state distribution associated

Figures (8)

  • Figure 1: Illustration of the bias of the terminating state distribution associated with $\pi^{*}_{\mathrm{MaxEnt}}$ on a soft MDP with a DAG structure. The labels on each transition of the MDP corresponds to the reward function, satisfying \ref{['eq:reward-function-soft-mdp']} (sparse reward setting). The terminating state distribution $\pi^{*}(x)$ is computed by marginalizing $\pi^{*}(\tau)$ over trajectories leading to $x$ (e.g., two trajectories $s_{0} \rightarrow s_{1} \rightarrow x_{4}$ and $s_{0} \rightarrow s_{2} \rightarrow x_{4}$ to $x_{4}$). $\pi^{*}(\tau)$ is computed based on \ref{['eq:distribution-trajectories']}, and we assume $\alpha = 1$. The terminating state distribution $\pi^{*}(x)$ should be contrasted with the (target) Gibbs distribution $P(x) \propto \exp(-{\mathcal{E}}(x))$. The normalization constant is $Z' = \exp(-{\mathcal{E}}(x_{3})) + 2\exp(-{\mathcal{E}}(x_{4})) + \exp(-{\mathcal{E}}(x_{5}))$. This MDP is inspired by jain2023gfnscientific.
  • Figure 2: Equivalence between objectives in MaxEnt RL, with corrected rewards, and the objectives in GFlowNets. The objectives are classified based on whether they operate at the level of (complete) trajectories (left), transitions (middle), or if all the states are terminating (right). Further details about the form of the different residuals and the correspondences to transfer from one objective to another are available in \ref{['fig:residual-equivalence']}.
  • Figure 3: Comparison of MaxEnt RL and GFlowNet algorithms on the factor graph inference task, in terms of the Jensen-Shannon divergence between the terminating state distribution and the target distribution during training. Each curve represents the average JSD with 95% confidence interval over 20 random seeds.
  • Figure 4: Comparison of MaxEnt RL and GFlowNet algorithms on the Bayesian structure learning task, in terms of the Jensen-Shannon divergence between the terminating state distribution and the target posterior during training. Both experiments differ in the way the marginal likelihood $P({\mathcal{D}}\mid G)$ is computed, (left) using the BGe score geiger1994bge, (right) is the linear Gaussian score nishikawa2022vbg. Each curve represents the average JSD with 95% confidence interval over 20 random seeds.
  • Figure 5: Comparison of MaxEnt RL and GFlowNet algorithms on the phylogenetic tree generation task. (Left) Comparison of the performance in terms of the Pearson correlation coefficient between the terminating state log-probability and the return on 1000 randomly sampled trees. (Center) Correlation between the terminating state log-probability found with DB and the return, each point representing a tree, with a best linear fit line and its slope. (Right) Similarly for SQL. The correlation plots for all methods and all datasets are available in \ref{['app:details-phylogfn']}.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Theorem 3.1: Gen. of tiapkin2023gfnmaxentrl; Theorem 1
  • Proposition 3.1
  • Proposition 3.1
  • Theorem A.1: Gen. of tiapkin2023gfnmaxentrl; Theorem 1
  • proof
  • Proposition B.0
  • proof
  • Corollary B.1
  • proof
  • Proposition B.2
  • ...and 5 more