Table of Contents
Fetching ...

HybridFlow: A Two-Step Generative Policy for Robotic Manipulation

Zhenchen Dong, Jinna Fu, Jiaming Wu, Shengyuan Yu, Fulin Chen, Yide Liu

TL;DR

HybridFlow is envisioned as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps.

Abstract

Limited by inference latency, existing robot manipulation policies lack sufficient real-time interaction capability with the environment. Although faster generation methods such as flow matching are gradually replacing diffusion methods, researchers are pursuing even faster generation suitable for interactive robot control. MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation, but its precision in action generation does not meet the stringent requirements of robotic manipulation. We therefore propose \textbf{HybridFlow}, a \textbf{3-stage method} with \textbf{2-NFE}: Global Jump in MeanFlow mode, ReNoise for distribution alignment, and Local Refine in ReFlow mode. This method balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps. Through real-world experiments, HybridFlow outperforms the 16-step Diffusion Policy by \textbf{15--25\%} in success rate while reducing inference time from 152ms to 19ms (\textbf{8$\times$ speedup}, \textbf{$\sim$52Hz}); it also achieves 70.0\% success on unseen-color OOD grasping and 66.3\% on deformable object folding. We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.

HybridFlow: A Two-Step Generative Policy for Robotic Manipulation

TL;DR

HybridFlow is envisioned as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps.

Abstract

Limited by inference latency, existing robot manipulation policies lack sufficient real-time interaction capability with the environment. Although faster generation methods such as flow matching are gradually replacing diffusion methods, researchers are pursuing even faster generation suitable for interactive robot control. MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation, but its precision in action generation does not meet the stringent requirements of robotic manipulation. We therefore propose \textbf{HybridFlow}, a \textbf{3-stage method} with \textbf{2-NFE}: Global Jump in MeanFlow mode, ReNoise for distribution alignment, and Local Refine in ReFlow mode. This method balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps. Through real-world experiments, HybridFlow outperforms the 16-step Diffusion Policy by \textbf{15--25\%} in success rate while reducing inference time from 152ms to 19ms (\textbf{8 speedup}, \textbf{52Hz}); it also achieves 70.0\% success on unseen-color OOD grasping and 66.3\% on deformable object folding. We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.
Paper Structure (5 sections, 14 equations, 7 figures, 1 table)

This paper contains 5 sections, 14 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Validation Loss of MeanFlow Cannot Reach Usable Levels. Comparison of validation loss across different methods on robot manipulation tasks. While MeanFlow (gray) shows loss reduction from 0-step to 1-step, the achieved loss magnitude ($\sim 10^{-3}$) remains significantly higher than multi-step methods (Diffusion Policy, ReFlow), which converge to $\sim 10^{-4}$ levels. This gap indicates that MeanFlow cannot achieve the precision required for reliable policy performance.
  • Figure 2: HybridFlow Inference Mechanism. Illustration of our 3-stage method with 2-NFE: (1) Global Jump uses MeanFlow mode ($r=0, t=1$) for fast coarse prediction, (2) ReNoise pulls the prediction back into training distribution via controlled noise injection, (3) Local Refine uses ReFlow mode ($r=t$) for precise correction. The method requires only two network forward passes (NFE=2), with ReNoise being a parameter-free interpolation stage.
  • Figure 3: HybridFlow System Architecture. The system processes fisheye RGB observations through a DINOv3 encoder to generate condition embeddings, which guide the HybridFlow action generator through a 3-stage method with 2-NFE: (1) Global Jump using MeanFlow mode for coarse prediction, (2) ReNoise to pull the prediction back into training distribution, and (3) Local Refine using ReFlow mode for precise correction. The final action trajectory is executed by the robot controller.
  • Figure 4: ReNoise Ratio Ablation. Validation loss vs ReNoise ratio $\alpha$ ($t_{\text{refine}}$). The optimal range is $\alpha \in [0.15, 0.20]$, balancing distribution alignment and semantic preservation. Too small ($< 0.10$) fails to correct distribution mismatch; too large ($> 0.25$) introduces excessive noise.
  • Figure 5: Simulation Results. Success rate vs inference steps on four RoboMimic benchmark tasks (Lift, Can, Square, Transport). HybridFlow (red points, 2-NFE) achieves competitive performance compared to multi-step baselines: Diffusion Policy (gray, 16+ steps), ReFlow (blue), and ShortCut (orange). Our method demonstrates that a 3-stage method with 2-NFE can match or exceed the accuracy of methods requiring 8--16 steps, validating our distribution alignment approach in controlled simulation environments.
  • ...and 2 more figures