Table of Contents
Fetching ...

Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

Yuxuan Gao, Yedong Shen, Shiqi Zhang, Wenhao Yu, Yifan Duan, Jia pan, Jiajia Wu, Jiajun Deng, Yanyong Zhang

Abstract

Although multi-step generative policies achieve strong performance in robotic manipulation by modeling multimodal action distributions, they require multi-step iterative denoising at inference time. Each action therefore needs tens to hundreds of network function evaluations (NFEs), making them costly for high-frequency closed-loop control and online reinforcement learning (RL). To address this limitation, we propose a two-stage framework for native one-step generative policies that shifts refinement from inference to training. First, we introduce the Drift-Based Policy (DBP), which leverages fixed-point drifting objectives to internalize iterative refinement into the model parameters, yielding a one-step generative backbone by design while preserving multimodal action modeling capacity. Second, we develop Drift-Based Policy Optimization (DBPO), an online RL framework that equips the pretrained backbone with a compatible stochastic interface, enabling stable on-policy updates without sacrificing the one-step deployment property. Extensive experiments demonstrate the effectiveness of the proposed framework across offline imitation learning, online fine-tuning, and real-world control scenarios. DBP matches or exceeds the performance of multi-step diffusion policies while achieving up to $100\times$ faster inference. It also consistently outperforms existing one-step baselines on challenging manipulation benchmarks. Moreover, DBPO enables effective and stable policy improvement in online settings. Experiments on a real-world dual-arm robot demonstrate reliable high-frequency control at 105.2 Hz.

Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

Abstract

Although multi-step generative policies achieve strong performance in robotic manipulation by modeling multimodal action distributions, they require multi-step iterative denoising at inference time. Each action therefore needs tens to hundreds of network function evaluations (NFEs), making them costly for high-frequency closed-loop control and online reinforcement learning (RL). To address this limitation, we propose a two-stage framework for native one-step generative policies that shifts refinement from inference to training. First, we introduce the Drift-Based Policy (DBP), which leverages fixed-point drifting objectives to internalize iterative refinement into the model parameters, yielding a one-step generative backbone by design while preserving multimodal action modeling capacity. Second, we develop Drift-Based Policy Optimization (DBPO), an online RL framework that equips the pretrained backbone with a compatible stochastic interface, enabling stable on-policy updates without sacrificing the one-step deployment property. Extensive experiments demonstrate the effectiveness of the proposed framework across offline imitation learning, online fine-tuning, and real-world control scenarios. DBP matches or exceeds the performance of multi-step diffusion policies while achieving up to faster inference. It also consistently outperforms existing one-step baselines on challenging manipulation benchmarks. Moreover, DBPO enables effective and stable policy improvement in online settings. Experiments on a real-world dual-arm robot demonstrate reliable high-frequency control at 105.2 Hz.

Paper Structure

This paper contains 17 sections, 17 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Generative policy paradigms for robot control. (a) Multi-step diffusion policies rely on iterative denoising at inference. (b) One-step mean-flow policies generate actions in one pass with auxiliary corrections. (c) Drift-Based Policy internalizes attraction-repulsion refinement during training, yielding a native one-step generator. (d) Our method achieves the best average success rate and control frequency on Adroit and MetaWorld against established generative baselines.
  • Figure 2: Two-stage Drift-Based Policy framework. Stage 1 learns a native one-step generator offline via attraction-repulsion refinement during training. Stage 2 fine-tunes a stochastic actor initialized from the pretrained backbone with on-policy PPO and anchor regularization, while deployment remains one-step (1-NFE).
  • Figure 3: Evolution of the internalized drift manifold. The policy action distribution (blue) progressively aligns with expert modes (peach) during training.
  • Figure 4: Online PPO fine-tuning results on RoboMimic and D4RL with anchor ablation (DBPO vs. DBPO w/o anchor). Solid bars denote offline initialization, and hatched bars denote gains after fine-tuning. In this evaluation setting, DBPO achieves the strongest post-fine-tuning performance, while removing the anchor consistently reduces gains over pretrained baselines.
  • Figure 5: Real-world bimanual deployment on the physical UR5 testbed. Drift-Based Policy executes precision Lift, Can, and synchronized bimanual Transport using raw trilateral camera inputs.