Table of Contents
Fetching ...

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang

TL;DR

RLFR addresses the limitations of Reinforcement Learning with Verifiable Rewards by introducing flow rewards derived from LLM latent space. It builds flow fields from off-policy high-quality data and online rejection sampling, and quantifies velocity deviations of policy latents to shape per-token rewards, linking velocity signals to likelihood via a score-based interpretation. Empirical results across language and multimodal reasoning benchmarks show that flow rewards improve performance over binary RLVR and logit-space shaping, while ablations highlight the importance of offline initialization, online updates, and robust reward filtering. The approach demonstrates latent representations as a productive substrate for reward design, offering a scalable paradigm for reward shaping beyond traditional outcome-based feedback with meaningful implications for reasoning-intensive tasks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

TL;DR

RLFR addresses the limitations of Reinforcement Learning with Verifiable Rewards by introducing flow rewards derived from LLM latent space. It builds flow fields from off-policy high-quality data and online rejection sampling, and quantifies velocity deviations of policy latents to shape per-token rewards, linking velocity signals to likelihood via a score-based interpretation. Empirical results across language and multimodal reasoning benchmarks show that flow rewards improve performance over binary RLVR and logit-space shaping, while ablations highlight the importance of offline initialization, online updates, and robust reward filtering. The approach demonstrates latent representations as a productive substrate for reward design, offering a scalable paradigm for reward shaping beyond traditional outcome-based feedback with meaningful implications for reasoning-intensive tasks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.

Paper Structure

This paper contains 23 sections, 32 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 2: Policy optimized with RLVR prone to overlook potential valuable explorations in reasoning trajectories. To beyond binary verification, auxiliary signals are used for reward shaping of process tokens, involving token entropy and likelihood collected from logit space, where self-policy rewarding risks are non-negligible. Alternatively, we show that the latent space is much underexplored yet highly expressive and a well established flow field can be a sound environment for yielding flow reward from velocity deviations and extending RLVR with latent reward utilization.
  • Figure 3: Distribution of trajectory tokens in LLM reasoning. (a) Distribution of trajectory tokens in latent space (up) and logit space (down). We perform 256 rollouts for prompt randomly sampled from MATH hendrycks2021measuring. The latent distribution show progressively expressive signals on tail trajectory tokens, as continuously interacting with preceding tokens for context compression. In contrast, neither the logit distribution nor the (b) & (c) textual clouds of reasoning trajectories reveal any distinguishable signals, highlighting the potential of latent space for reward utilization.
  • Figure 3: Case study of reward tokens in training progress. "+" means positive flow reward and "-" means negative flow reward.
  • Figure 4: Textual cloud of offline start dataset.
  • Figure 5: Results on different timesteps for flow reward and debiasing effect.
  • ...and 3 more figures

Theorems & Definitions (2)

  • proof
  • proof