Table of Contents
Fetching ...

MG2FlowNet: Accelerating High-Reward Sample Generation via Enhanced MCTS and Greediness Control

Rui Zhu, Xuan Yu, Yudong Zhang, Chen Zhang, Xu Wang, Yang Wang

TL;DR

MG2FlowNet tackles the challenge of efficiently generating high-reward samples with diverse coverage in GFlowNets by integrating an enhanced MCTS framework with a polynomial upper confidence tree and a tunable $\alpha$-greedy mechanism. The method uses PUCT-guided selection, expands all legal actions, and performs simulations that leverage forward transition probabilities to evaluate trajectories, while backpropagating rewards along promising paths with a constrained credit assignment rule. A key contribution is the controllable Greediness Coefficient $\alpha$, which blends the forward policy $P_F$ with a learned $Q$-value distribution to adapt exploration and exploitation throughout training. Empirical results on Hypergrid and Molecule Design tasks show faster discovery and sustained high-reward sampling without sacrificing diversity, highlighting the method’s practicality for large, sparse-reward domains and its reproducibility via released code.

Abstract

Generative Flow Networks (GFlowNets) have emerged as a powerful tool for generating diverse and high-reward structured objects by learning to sample from a distribution proportional to a given reward function. Unlike conventional reinforcement learning (RL) approaches that prioritize optimization of a single trajectory, GFlowNets seek to balance diversity and reward by modeling the entire trajectory distribution. This capability makes them especially suitable for domains such as molecular design and combinatorial optimization. However, existing GFlowNets sampling strategies tend to overexplore and struggle to consistently generate high-reward samples, particularly in large search spaces with sparse high-reward regions. Therefore, improving the probability of generating high-reward samples without sacrificing diversity remains a key challenge under this premise. In this work, we integrate an enhanced Monte Carlo Tree Search (MCTS) into the GFlowNets sampling process, using MCTS-based policy evaluation to guide the generation toward high-reward trajectories and Polynomial Upper Confidence Trees (PUCT) to balance exploration and exploitation adaptively, and we introduce a controllable mechanism to regulate the degree of greediness. Our method enhances exploitation without sacrificing diversity by dynamically balancing exploration and reward-driven guidance. The experimental results show that our method can not only accelerate the speed of discovering high-reward regions but also continuously generate high-reward samples, while preserving the diversity of the generative distribution. All implementations are available at https://github.com/ZRNB/MG2FlowNet.

MG2FlowNet: Accelerating High-Reward Sample Generation via Enhanced MCTS and Greediness Control

TL;DR

MG2FlowNet tackles the challenge of efficiently generating high-reward samples with diverse coverage in GFlowNets by integrating an enhanced MCTS framework with a polynomial upper confidence tree and a tunable -greedy mechanism. The method uses PUCT-guided selection, expands all legal actions, and performs simulations that leverage forward transition probabilities to evaluate trajectories, while backpropagating rewards along promising paths with a constrained credit assignment rule. A key contribution is the controllable Greediness Coefficient , which blends the forward policy with a learned -value distribution to adapt exploration and exploitation throughout training. Empirical results on Hypergrid and Molecule Design tasks show faster discovery and sustained high-reward sampling without sacrificing diversity, highlighting the method’s practicality for large, sparse-reward domains and its reproducibility via released code.

Abstract

Generative Flow Networks (GFlowNets) have emerged as a powerful tool for generating diverse and high-reward structured objects by learning to sample from a distribution proportional to a given reward function. Unlike conventional reinforcement learning (RL) approaches that prioritize optimization of a single trajectory, GFlowNets seek to balance diversity and reward by modeling the entire trajectory distribution. This capability makes them especially suitable for domains such as molecular design and combinatorial optimization. However, existing GFlowNets sampling strategies tend to overexplore and struggle to consistently generate high-reward samples, particularly in large search spaces with sparse high-reward regions. Therefore, improving the probability of generating high-reward samples without sacrificing diversity remains a key challenge under this premise. In this work, we integrate an enhanced Monte Carlo Tree Search (MCTS) into the GFlowNets sampling process, using MCTS-based policy evaluation to guide the generation toward high-reward trajectories and Polynomial Upper Confidence Trees (PUCT) to balance exploration and exploitation adaptively, and we introduce a controllable mechanism to regulate the degree of greediness. Our method enhances exploitation without sacrificing diversity by dynamically balancing exploration and reward-driven guidance. The experimental results show that our method can not only accelerate the speed of discovering high-reward regions but also continuously generate high-reward samples, while preserving the diversity of the generative distribution. All implementations are available at https://github.com/ZRNB/MG2FlowNet.

Paper Structure

This paper contains 53 sections, 20 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Strategy of MG2FlowNet.MG2FlowNet prioritizes high reward states in explored regions while still allocating effort to unexplored areas, ensuring that potential high reward states are not overlooked.
  • Figure 2: Illustration of framework. The left panel shows trajectory sampling in GFlowNets, where each action is chosen based on the updated $Q(s, a)$ and $P_F$ after $I$ rounds of MCTS iterations. The right panel illustrates the MCTS procedure, including selection, expansion, simulation, and backpropagation, with the fourth iteration shown as an example for clarity.
  • Figure 3: High Reward Mode Discovery and Distribution Matching Error on Hypergrid.Left: Comparison of the number of high-reward region modes that different models can find with the same number of visits. Right: Comparison of $\ell_1$ loss across models, measuring deviation between learned sampling distribution and target reward distribution.
  • Figure 4: Number of modes with reward $>7.5$ and $>8.0$ in molecule design task.Left: Comparison of different models in terms of the number of modes with reward greater than 7.5. Right: Comparison of different models in terms of the number of modes with reward greater than 8.0.
  • Figure 5: The Tanimoto similarity among the top-1000 molecules with the highest rewards generated by different models.
  • ...and 1 more figures