Table of Contents
Fetching ...

floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

Bhavya Agrawalla, Michal Nauman, Khush Agrawal, Aviral Kumar

TL;DR

floq proposes training value-based critics by parameterizing the Q-function as a time-dependent velocity field and supervising its iterative flow via flow-matching with TD bootstrapping. By controlling integration steps and leveraging dense intermediate supervision, floq enhances Q-capacity scaling beyond monolithic networks and ensembles, achieving up to ~1.8× gains on offline RL benchmarks and online fine-tuning tasks. Key design choices, including HL-Gauss encoding for interpolants and Fourier time embeddings, enable stable training and effective use of iterative computation. This work introduces a new compute-scaling axis for value-based RL through sequential, test-time flow integration and suggests broad avenues for theory and practical future directions.

Abstract

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.

floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

TL;DR

floq proposes training value-based critics by parameterizing the Q-function as a time-dependent velocity field and supervising its iterative flow via flow-matching with TD bootstrapping. By controlling integration steps and leveraging dense intermediate supervision, floq enhances Q-capacity scaling beyond monolithic networks and ensembles, achieving up to ~1.8× gains on offline RL benchmarks and online fine-tuning tasks. Key design choices, including HL-Gauss encoding for interpolants and Fourier time embeddings, enable stable training and effective use of iterative computation. This work introduces a new compute-scaling axis for value-based RL through sequential, test-time flow integration and suggests broad avenues for theory and practical future directions.

Abstract

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.

Paper Structure

This paper contains 17 sections, 5 equations, 16 figures, 10 tables, 1 algorithm.

Figures (16)

  • Figure 1: floq architecture. Our approach models the Q-function via a velocity field in a flow-matching generative model. Over multiple calls, this velocity field converts a randomly sampled input $\boldsymbol{z}(0)$ into a sample from the Dirac-delta distribution centered at the Q-value. We prescribe how this sample can be trained via a flow-matching loss. Doing this enables us to scale computation by running numerical integration, with multiple calls to the velocity field. To train floq, we utilize a categorical representation of input $\boldsymbol{z}_t$farebrother2024stop.
  • Figure 2: Illustrating the role of our design choices.Left: When the width of the interval $[l, u]$ is small, and the overlap between this interval and the range of target Q-values we hope to see is minimal, we would expect to see more straight flow traversals, that might be independent of interpolant $\boldsymbol{z}$. However, with wider intervals $[l, u]$, the flow traversal would depend on $\boldsymbol{z}$, and hence span a curved path when running numerical integration during inference. Right: Illustrating how we transform an input interpolant $\boldsymbol{z}$ into a categorical representation (top) and converting time $t$ into a Fourier-basis embedding (bottom).
  • Figure 3: OGBench ogbench_park2025 domains that we study in this work. These tasks include high-dimensional state and action spaces, sparse rewards, stochasticity, as well as hierarchical structure.
  • Figure 4: Comparison of floq against the baseline FQL across median, interquartile mean (IQM), mean and optimality gap, following rliable_agarwal2021. Results show that floq consistently outperform FQL across all evaluation criteria with no confidence interval overlap in all cases, meaning that the gains from floq are significant.
  • Figure 5: Learning curves for online fine-tuning of floq and FQL across four hard tasks. floq not only provides a stronger initialization from offline RL training but also maintains its advantage throughout online fine-tuning on the hardest tasks, leading to faster adaptation and higher final success rates. The shaded gray area denotes offline RL training.
  • ...and 11 more figures