SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
Yixian Zhang, Shu'ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, Wenbo Ding
TL;DR
SAC Flow addresses the gradient instability plaguing off-policy learning for flow-based policies by reinterpreting the K-step flow rollout as a sequential model and reparameterizing the velocity network with Flow-G (GRU-style gating) and Flow-T (Transformer-based decoding). A noise-augmented rollout enables tractable likelihoods for SAC, enabling direct end-to-end training in both from-scratch and offline-to-online regimes. Flow-G and Flow-T achieve state-of-the-art sample efficiency on MuJoCo locomotion and robotic manipulation benchmarks, eliminating the need for policy distillation or surrogate objectives, and demonstrating strong offline-to-online performance with ablations confirming gradient stability. The work suggests that stable, expressive flow-based policies can be trained efficiently in realistic settings and points to future real-robot evaluation and lightweight sequential alternatives for robustness and sim-to-real transfer.
Abstract
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.
