Table of Contents
Fetching ...

Latent Policy Steering through One-Step Flow Policies

Hokyun Im, Andrey Kolobov, Jianlong Fu, Youngwoon Lee

TL;DR

This work proposes Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor.

Abstract

Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.

Latent Policy Steering through One-Step Flow Policies

TL;DR

This work proposes Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor.

Abstract

Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.
Paper Structure (38 sections, 11 equations, 17 figures, 7 tables, 2 algorithms)

This paper contains 38 sections, 11 equations, 17 figures, 7 tables, 2 algorithms.

Figures (17)

  • Figure 1: Comparison of policy extraction paradigms.(Top)QC-FQL constrains the policy via an explicit regularizer, creating a trade-off between reward maximization and behavioral regularization. (Middle)DSRL resolves this trade-off via latent steering, but requires learning a latent-space critic $Q(s,z)$ via distillation in the offline RL setting. (Bottom)LPS (Ours) achieves robust, tuning-free optimization by backpropagating action-space critic gradients $\nabla_a Q(s,a)$ through a differentiable one-step generative policy.
  • Figure 2: Sensitivity to the regularization weight $\alpha$ in FQL. Learned policy densities on a 2D toy task with reward concentrated in the top-right corner reveal a pattern: large $\alpha$ yields overly conservative policies, while small $\alpha$ encourages out-of-support actions.
  • Figure 3: Comparing action space Q-value and distilled latent-space Q-value. Left to right: (1) dataset distribution with reward intensity; (2) action-space Q-value $Q_\phi(s, a)$ projected into the latent space; (3) learned latent Q-value $Q_\phi(s, z)$; (4) cosine similarity between the gradients in (2) and (3).
  • Figure 4: OGBench Manipulation Tasks.
  • Figure 5: Performance on OGBench. We evaluate the success rates across tasks. Bars report the mean success rate over $3$ seeds, and error bars indicate the $95$% confidence interval estimated using bootstrap resampling with $1$K iterations.
  • ...and 12 more figures