Table of Contents
Fetching ...

FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning

Marvin Alles, Nutan Chen, Patrick van der Smagt, Botond Cseke

TL;DR

FlowQ tackles offline RL under distributional shift by learning an energy-guided flow policy via energy-guided flow matching, which models a time-dependent velocity field that steers a diffusion-like path toward high-value actions. The method learns a policy of the form $\pi(a|s) \propto \pi_\beta(a|s) \exp(Q(s,a))$ using a Gaussian-path approximation to a posterior-like distribution, enabling simulation-free training that does not require sampling from the policy during learning and keeps training cost independent of the number of flow steps. Empirically, FlowQ is competitive with state-of-the-art offline RL methods on the D4RL benchmark, with particular strength on antmaze and manipulation tasks, and shows favorable efficiency compared to diffusion-based approaches due to reduced backpropagation through action samples. The approach offers a scalable, practical alternative for offline RL with multimodal action distributions and can be extended via adaptive energy scaling to broaden its applicability.

Abstract

The use of guidance to steer sampling toward desired outcomes has been widely explored within diffusion models, especially in applications such as image and trajectory generation. However, incorporating guidance during training remains relatively underexplored. In this work, we introduce energy-guided flow matching, a novel approach that enhances the training of flow models and eliminates the need for guidance at inference time. We learn a conditional velocity field corresponding to the flow policy by approximating an energy-guided probability path as a Gaussian path. Learning guided trajectories is appealing for tasks where the target distribution is defined by a combination of data and an energy function, as in reinforcement learning. Diffusion-based policies have recently attracted attention for their expressive power and ability to capture multi-modal action distributions. Typically, these policies are optimized using weighted objectives or by back-propagating gradients through actions sampled by the policy. As an alternative, we propose FlowQ, an offline reinforcement learning algorithm based on energy-guided flow matching. Our method achieves competitive performance while the policy training time is constant in the number of flow sampling steps.

FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning

TL;DR

FlowQ tackles offline RL under distributional shift by learning an energy-guided flow policy via energy-guided flow matching, which models a time-dependent velocity field that steers a diffusion-like path toward high-value actions. The method learns a policy of the form using a Gaussian-path approximation to a posterior-like distribution, enabling simulation-free training that does not require sampling from the policy during learning and keeps training cost independent of the number of flow steps. Empirically, FlowQ is competitive with state-of-the-art offline RL methods on the D4RL benchmark, with particular strength on antmaze and manipulation tasks, and shows favorable efficiency compared to diffusion-based approaches due to reduced backpropagation through action samples. The approach offers a scalable, practical alternative for offline RL with multimodal action distributions and can be extended via adaptive energy scaling to broaden its applicability.

Abstract

The use of guidance to steer sampling toward desired outcomes has been widely explored within diffusion models, especially in applications such as image and trajectory generation. However, incorporating guidance during training remains relatively underexplored. In this work, we introduce energy-guided flow matching, a novel approach that enhances the training of flow models and eliminates the need for guidance at inference time. We learn a conditional velocity field corresponding to the flow policy by approximating an energy-guided probability path as a Gaussian path. Learning guided trajectories is appealing for tasks where the target distribution is defined by a combination of data and an energy function, as in reinforcement learning. Diffusion-based policies have recently attracted attention for their expressive power and ability to capture multi-modal action distributions. Typically, these policies are optimized using weighted objectives or by back-propagating gradients through actions sampled by the policy. As an alternative, we propose FlowQ, an offline reinforcement learning algorithm based on energy-guided flow matching. Our method achieves competitive performance while the policy training time is constant in the number of flow sampling steps.

Paper Structure

This paper contains 21 sections, 18 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Energy-Guided Flow Matching. Flow matching learns a flow to map from a known source distribution $p_0(x_0)$ to the data distribution $p_1(x_1)$. In contrast, energy-guided flow matching seeks to learn a transformation from a known source distribution $p_0(x_0)$ to an energy-weighted distribution $\hat{p}_1(x_1) \propto p_1(x_1) \exp \{- \lambda \,\mathcal{E}(x_1) \}$ instead. In the minimal example shown in the figure, the energy function $\mathcal{E}$ guides the flow toward generating the upper mode of the multimodal data distribution.
  • Figure 2: Visualization of energy-guided flow matching. Given samples of the data distribution and an energy function, energy-guided flow matching approximates the posterior. We compare fixed energy scaling schedules $t$, $t^2$, $\frac{t^2}{1-t}$, as well as a learnable schedule $h_{\theta}(t)$.
  • Figure 3: In (a) we plot policy training time in seconds for $10^4$ gradient steps over the number of sampling steps $t$. In (b) we plot the normalized return on halfcheetah environments for different choices of $h(t)$.
  • Figure 4: Normalized return of FlowQ for the D4RL locomotion environments using 5 seeds.
  • Figure 5: Normalized return of FlowQ for the D4RL adroit environments using 5 seeds.
  • ...and 1 more figures