Latent Policy Steering through One-Step Flow Policies

Hokyun Im; Andrey Kolobov; Jianlong Fu; Youngwoon Lee

Latent Policy Steering through One-Step Flow Policies

Hokyun Im, Andrey Kolobov, Jianlong Fu, Youngwoon Lee

TL;DR

This work proposes Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor.

Abstract

Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.

Latent Policy Steering through One-Step Flow Policies

TL;DR

Abstract

Paper Structure (38 sections, 11 equations, 17 figures, 7 tables, 2 algorithms)

This paper contains 38 sections, 11 equations, 17 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Generative Behavior Constraints in Offline RL
Reinforcement Learning in Latent Action Spaces
One-Step Generative Models for Robot Learning
Preliminaries
Reinforcement Learning with Action Chunking
MeanFlow for One-step Generative Modeling
Limitations of Prior Work
Latent Policy Steering (LPS)
Differentiable Base Policy via MeanFlow
Spherical Latent Geometry
Direct Latent Policy Steering
Simulation Experiments
Experimental Setup
...and 23 more sections

Figures (17)

Figure 1: Comparison of policy extraction paradigms.(Top)QC-FQL constrains the policy via an explicit regularizer, creating a trade-off between reward maximization and behavioral regularization. (Middle)DSRL resolves this trade-off via latent steering, but requires learning a latent-space critic $Q(s,z)$ via distillation in the offline RL setting. (Bottom)LPS (Ours) achieves robust, tuning-free optimization by backpropagating action-space critic gradients $\nabla_a Q(s,a)$ through a differentiable one-step generative policy.
Figure 2: Sensitivity to the regularization weight $\alpha$ in FQL. Learned policy densities on a 2D toy task with reward concentrated in the top-right corner reveal a pattern: large $\alpha$ yields overly conservative policies, while small $\alpha$ encourages out-of-support actions.
Figure 3: Comparing action space Q-value and distilled latent-space Q-value. Left to right: (1) dataset distribution with reward intensity; (2) action-space Q-value $Q_\phi(s, a)$ projected into the latent space; (3) learned latent Q-value $Q_\phi(s, z)$; (4) cosine similarity between the gradients in (2) and (3).
Figure 4: OGBench Manipulation Tasks.
Figure 5: Performance on OGBench. We evaluate the success rates across tasks. Bars report the mean success rate over $3$ seeds, and error bars indicate the $95$% confidence interval estimated using bootstrap resampling with $1$K iterations.
...and 12 more figures

Latent Policy Steering through One-Step Flow Policies

TL;DR

Abstract

Latent Policy Steering through One-Step Flow Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (17)