Table of Contents
Fetching ...

GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

He Zhang, Ying Sun, Hui Xiong

Abstract

Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches. Code will be available at https://github.com/ZhHe11/GSFlow-RL.

GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

Abstract

Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches. Code will be available at https://github.com/ZhHe11/GSFlow-RL.
Paper Structure (60 sections, 20 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 60 sections, 20 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: An illustration of denoising from an uninformed Gaussian prior (a) versus an informed, value-guided prior (b). Deeper blue indicates higher value.
  • Figure 2: The visualization of the multi-crescent task.
  • Figure 3: Overview of our algorithm. During training, we first learn a structured prior for the initial noise, which is then used to distill the teacher policy. For online exploration, actions are sampled from the student's entropy-regularized distribution. During evaluation, the deterministic mean of the policy's output is used. The critic update steps are omitted for clarity, detailed in Appendix \ref{['app:q_update']}.
  • Figure 4: Visualization of the learned prior distribution after different training stages.
  • Figure 5: Results on the multi-crescent task. Blue crosses denote samples from the offline dataset, while yellow stars represent the actions produced by the policies. (a): shows the offline dataset, which excludes the two globally optimal modes. (b, c): shows action distributions after the offline phase. Our method captures the higher-value modes within the dataset, while the baseline shows a less focused distribution. (d, e): shows action distributions after the online fine-tuning phase. Our method quickly discovers and converges to both highest-reward modes. In contrast, the baseline only finds one. More results on this task can be found in Appendix \ref{['fig:app_ABToyEnv']}.
  • ...and 8 more figures