Table of Contents
Fetching ...

FlowRL: Matching Reward Distributions for LLM Reasoning

Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin

TL;DR

FlowRL introduces reward distribution matching for LLM reasoning by normalizing scalar rewards with a learnable partition function Z_phi and minimizing the reverse KL divergence between the policy and a reward-weighted target distribution. Grounded in GFlowNets via trajectory balance, FlowRL provides a practical squared-loss surrogate and integrates length normalization and importance sampling to handle long chain-of-thought reasoning. Empirically, FlowRL outperforms reward-maximizing baselines (PPO and GRPO) on both math and code benchmarks, with increased solution diversity and robust generalization across model scales. The approach represents a principled shift toward exploration-rich, diverse reasoning trajectories in RL for LLMs, with strong implications for scalable reasoning tasks.

Abstract

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

FlowRL: Matching Reward Distributions for LLM Reasoning

TL;DR

FlowRL introduces reward distribution matching for LLM reasoning by normalizing scalar rewards with a learnable partition function Z_phi and minimizing the reverse KL divergence between the policy and a reward-weighted target distribution. Grounded in GFlowNets via trajectory balance, FlowRL provides a practical squared-loss surrogate and integrates length normalization and importance sampling to handle long chain-of-thought reasoning. Empirically, FlowRL outperforms reward-maximizing baselines (PPO and GRPO) on both math and code benchmarks, with increased solution diversity and robust generalization across model scales. The approach represents a principled shift toward exploration-rich, diverse reasoning trajectories in RL for LLMs, with strong implications for scalable reasoning tasks.

Abstract

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of over GRPO and over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

Paper Structure

This paper contains 33 sections, 2 theorems, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Top: Comparison between distribution-matching and reward-maximizing approaches. FlowRL (left) learns to match the full reward distribution, maintaining diversity across multiple modes with low KL divergence. In contrast, reward-maximizing methods like GRPO (right) concentrate on a single high-reward peak, leading to mode collapse and higher KL divergence. Bottom: Performance comparison. FlowRL consistently outperforms GRPO across math and code domains.
  • Figure 2: GFlowNets JMLR:v24:22-0364, a flow-balance perspective on reinforcement learning. The initial flow $Z_\phi(s_0)$ injects probability mass into the environment, which is transported through intermediate states by the policy $\pi_\theta$ and accumulated at terminal states in proportion to the scalar rewards.
  • Figure 3: Ablation study on the $\beta$ in FlowRL. $\beta = 15$ (highlighted in blue) achieves the best performance.
  • Figure 4: GPT-judged diversity scores on rollouts of AIME 24/25 problems. FlowRL generates more diverse solutions than R++, GRPO, and PPO.

Theorems & Definitions (6)

  • Proposition 1
  • Remark 2: Trajectory balance as a practical surrogate for KL minimization
  • Remark 3: Reward shaping via length normalization
  • Remark 4: Importance sampling for data-efficient training
  • Proposition 5
  • Remark 6: FlowRL beyond reward maximization