Table of Contents
Fetching ...

Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

Franki Nguimatsia Tiofack, Théotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier

TL;DR

The paper tackles offline reinforcement learning by addressing extrapolation error through Guided Flow Policy (GFP), which combines a value-aware flow policy (VaBC) with a distilled one-step actor in a bidirectional BRAC framework. GFP uses a temperature-controlled guiding function $g_\eta$ to selectively clone high-value dataset actions while keeping the policy within the dataset support, avoiding backpropagation through time. The approach yields state-of-the-art results across 144 tasks from OGBench, Minari, and D4RL, with notable gains on suboptimal datasets and hard tasks. The authors also analyze the temperature parameter and demonstrate the importance of hyperparameter tuning by re-evaluating prior baselines on OGBench, highlighting practical considerations for robust offline RL benchmarking.

Abstract

Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/

Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

TL;DR

The paper tackles offline reinforcement learning by addressing extrapolation error through Guided Flow Policy (GFP), which combines a value-aware flow policy (VaBC) with a distilled one-step actor in a bidirectional BRAC framework. GFP uses a temperature-controlled guiding function to selectively clone high-value dataset actions while keeping the policy within the dataset support, avoiding backpropagation through time. The approach yields state-of-the-art results across 144 tasks from OGBench, Minari, and D4RL, with notable gains on suboptimal datasets and hard tasks. The authors also analyze the temperature parameter and demonstrate the importance of hyperparameter tuning by re-evaluating prior baselines on OGBench, highlighting practical considerations for robust offline RL benchmarking.

Abstract

Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/

Paper Structure

This paper contains 17 sections, 12 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the Guided Flow Policy framework. GFP consists of three main components: (i) in yellow, VaBC, a multi-step flow policy $\pi_\omega$ trained via weighted BC using the guidance term $g_\eta$, (ii) in green, a one-step actor $\pi_\theta$ distilled from the flow policy, and (iii) in gray, a critic $Q_\phi$ guiding action evaluation. $\pi_\omega$ regularizes the actor toward high-value actions from the dataset; in turn, the actor shapes the flow and optimizes the critic following the actor--critic approach. The different components of the figure are introduced throughout the paper. Each drawing represents the probability distribution of actions $a \in \mathcal{A}$ of a policy, in a current state $s$, except for the gray ones, where it is the value of actions $a \in \mathcal{A}$ in state $s$, according to the critic.
  • Figure 2: Comparison of behavior cloning under different levels of guidance.Left: Prior work (e.g., FQL, park2025flow) uses no filtering, indiscriminately imitating all state-action pairs. Right: In contrast, our method introduces a temperature-controlled guidance mechanism, as shown in Eq. \ref{['eq:guidance-term']}, resulting in VaBC. At high temperatures, the guidance is weak, so the actor is influenced by many candidate actions. At moderate temperatures, the filtering becomes sharper, giving more weight to higher-value actions while still keeping enough regularization and exploration. At low temperatures, the filtering is very selective, concentrating almost exclusively on the highest-value actions according to the critic. However, excessive concentration at very low temperatures may allow the actor to escape the dataset's action distribution, as shown on the right in green, leading to critic overestimation and out-of-distribution issues. Importantly, VaBC cannot escape the dataset's action distribution even at very low temperatures, since it trains exclusively on in-distribution state-action pairs. The dashed blue contours in the final yellow drawings (first row) illustrate this constraint.
  • Figure 3: OGBench analysis. (a) Performance profiles for 50 tasks comparing GFP against a wide range of prior works, showing the fraction of tasks where each algorithm achieves a score above threshold $\tau$, using the evaluation reported by park2025flow. (b) Performance profiles on 105 tasks, including more challenging ones, and carefully reevaluated prior methods. (c) Performance profiles restricted to 30 noisy and explore tasks.
  • Figure 4: Temperature analysis on challenging OGBench Puzzle (left) and Cube (right) tasks with suboptimal data.Plots (a) and (c): performance scores across temperature values $\eta$ for our GFP method (Actor $\pi_\theta$ and VaBC $\pi_\omega$) compared to baselines (FQL, ReBRAC) on puzzle-4x4-noisy-task3 and cube-double-noisy-task2. Plots (b) and (d): probability that the guidance term $g_\eta$ is above different thresholds $\delta$ as a function of temperature, illustrating how temperature controls the sharpness of value-guided filtering.
  • Figure 5: Overview of some offline-RL frameworks. The symbol $\!\star\!$ indicates the use of a diffusion or a flow model, and specifically in which component. Each box, provides the name of the component (e.g. critic) and the working principle that is used to train it (e.g. TD learning). The arrows indicate how the components depend on each other, while the dashed arrow is optional (see Eq. \ref{['eq:vabc_target']})
  • ...and 3 more figures