Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
Franki Nguimatsia Tiofack, Théotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier
TL;DR
The paper tackles offline reinforcement learning by addressing extrapolation error through Guided Flow Policy (GFP), which combines a value-aware flow policy (VaBC) with a distilled one-step actor in a bidirectional BRAC framework. GFP uses a temperature-controlled guiding function $g_\eta$ to selectively clone high-value dataset actions while keeping the policy within the dataset support, avoiding backpropagation through time. The approach yields state-of-the-art results across 144 tasks from OGBench, Minari, and D4RL, with notable gains on suboptimal datasets and hard tasks. The authors also analyze the temperature parameter and demonstrate the importance of hyperparameter tuning by re-evaluating prior baselines on OGBench, highlighting practical considerations for robust offline RL benchmarking.
Abstract
Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/
