Table of Contents
Fetching ...

SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

Yixian Zhang, Shu'ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, Wenbo Ding

TL;DR

SAC Flow addresses the gradient instability plaguing off-policy learning for flow-based policies by reinterpreting the K-step flow rollout as a sequential model and reparameterizing the velocity network with Flow-G (GRU-style gating) and Flow-T (Transformer-based decoding). A noise-augmented rollout enables tractable likelihoods for SAC, enabling direct end-to-end training in both from-scratch and offline-to-online regimes. Flow-G and Flow-T achieve state-of-the-art sample efficiency on MuJoCo locomotion and robotic manipulation benchmarks, eliminating the need for policy distillation or surrogate objectives, and demonstrating strong offline-to-online performance with ablations confirming gradient stability. The work suggests that stable, expressive flow-based policies can be trained efficiently in realistic settings and points to future real-robot evaluation and lightweight sequential alternatives for robustness and sim-to-real transfer.

Abstract

Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.

SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

TL;DR

SAC Flow addresses the gradient instability plaguing off-policy learning for flow-based policies by reinterpreting the K-step flow rollout as a sequential model and reparameterizing the velocity network with Flow-G (GRU-style gating) and Flow-T (Transformer-based decoding). A noise-augmented rollout enables tractable likelihoods for SAC, enabling direct end-to-end training in both from-scratch and offline-to-online regimes. Flow-G and Flow-T achieve state-of-the-art sample efficiency on MuJoCo locomotion and robotic manipulation benchmarks, eliminating the need for policy distillation or surrogate objectives, and demonstrating strong offline-to-online performance with ablations confirming gradient stability. The work suggests that stable, expressive flow-based policies can be trained efficiently in realistic settings and points to future real-robot evaluation and lightweight sequential alternatives for robustness and sim-to-real transfer.

Abstract

Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.

Paper Structure

This paper contains 57 sections, 49 equations, 17 figures, 5 tables, 3 algorithms.

Figures (17)

  • Figure 1: An Overview of SAC Flow. The multi-step sampling process of flow-based policies frequently causes exploding gradients during off-policy RL updates. Our key insight is to treat the flow-based policy as a sequential model, for which we first demonstrate an algebraic equivalence to an RNN. We then reparameterize the flow's velocity network using modern sequential architectures (e.g., GRU, Transformer). Our approach stabilizes off-policy RL training and achieves state-of-the-art performance.
  • Figure 2: An illustration of gradient norms during training. By conceptualizing a flow-based model as an RNN, the most basic sequential models, we observe that it still suffers from the exploding gradients during training. This motivates our work to model the flow-based model as advanced sequential architectures, such as a GRU or a Transformer. These models can be updated with stable gradients during the backpropagation process.
  • Figure 3: Velocity network parameterizations for the flow-based policy, shown in the view of sequential models. (a) RNN Cell: It represents the standard flow-based policy where the velocity $v_{\theta}$ is the direct output of a neural network. This simple formulation is prone to gradient instability. (b) GRU Cell: The velocity is computed using a GRU-style gated mechanism. A gate $g_i$ adaptively controls the update strength from a candidate network $\hat{v}_i$, which stabilizes gradient flow. (c) Decoder: The velocity is modeled using a Transformer decoder, where the action-time token $A_{t_i}$ is refined through $L$ layers of state-conditioned cross-attention to produce a decoded velocity.
  • Figure 4: From-scratch training performance. Our SAC Flow-T and SAC Flow-G achieve comparable or better performance accross all tasks except Humanoid (Fig. (a)-(f)), demonstrating significant sample efficiency and convergence stability. However, all methods struggle on the hard-exploration, sparse-reward tasks (Can from Robomimic, and Cube-Double from OGBench), highlighting the necessity of offline-to-online training.
  • Figure 5: Aggregated offline-to-online performance on OGBench and Robomimic benchmarks. Each curve shows the mean success rate averaged across multiple task instances within a domain. Specifically, the OGBench results for Cube-Double, Triple, and Quadruple (a-c) are each aggregated over five distinct single-task environments. The Robomimic result (d) is aggregated across the Lift, Can, and Square tasks.
  • ...and 12 more figures