Table of Contents
Fetching ...

Flow Q-Learning

Seohong Park, Qiyang Li, Sergey Levine

TL;DR

Flow Q-Learning (FQL) introduces an offline RL method that uses an expressive flow-matching policy to model complex action distributions while sidestepping the instability of training iterative flow policies with RL. It achieves this by training a separate one-step policy to maximize Q-values, guided by distillation from a BC-trained flow policy, thus avoiding backpropagation through time during training and inference. Empirically, FQL demonstrates strong performance across 73 state- and pixel-based tasks and remains effective for offline-to-online fine-tuning, with a simple implementation and robust hyperparameter behavior. The approach highlights the value of one-step policy guidance combined with flow-based modeling for scalable, high-capacity offline RL.

Abstract

We present flow Q-learning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL. Project page: https://seohong.me/projects/fql/

Flow Q-Learning

TL;DR

Flow Q-Learning (FQL) introduces an offline RL method that uses an expressive flow-matching policy to model complex action distributions while sidestepping the instability of training iterative flow policies with RL. It achieves this by training a separate one-step policy to maximize Q-values, guided by distillation from a BC-trained flow policy, thus avoiding backpropagation through time during training and inference. Empirically, FQL demonstrates strong performance across 73 state- and pixel-based tasks and remains effective for offline-to-online fine-tuning, with a simple implementation and robust hyperparameter behavior. The approach highlights the value of one-step policy guidance combined with flow-based modeling for scalable, high-capacity offline RL.

Abstract

We present flow Q-learning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL. Project page: https://seohong.me/projects/fql/

Paper Structure

This paper contains 16 sections, 10 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Flow Q-learning. Flow-matching policies can model complex action distributions, but training an iterative flow policy with RL is challenging. To address this, we train an expressive one-step policy $:\mu_\omega({\color{plgray} s}, {\color{plgray} z}): {\mathcal{S}} \times {\mathbb{R}}^d \to {\mathcal{A}}$ to maximize Q values, while regularizing it with distillation from a BC flow policy.
  • Figure 2: The idea. Offline RL is essentially a tug-of-war between behavioral regularization and value maximization. (a) Naïvely doing this with a flow policy involves costly and unstable backpropagation through time (BPTT). (b) We resolve this by training a separate one-step policy, which maximizes values without BPTT while being regularized by a distillation loss from a BC flow policy.
  • Figure 3: One-step policy. The one-step policy $\mu_\omega$ learns the direct mapping from $z$ to $a$ of the flow policy $\mu_\theta$, while simultaneously maximizing values (this part is omitted in the figure).
  • Figure 4: OGBench tasks.
  • Figure 5: Policy extraction is important. The bars above compare the performances of different policy extraction methods averaged over the $50$ state-based OGBench tasks in \ref{['table:offline']}.
  • ...and 7 more figures