Table of Contents
Fetching ...

SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment

Rong Xue, Jiageng Mao, Mingtong Zhang, Yue Wang

TL;DR

This work addresses the challenge of real-time visuomotor imitation by proposing Selective Flow Alignment (SeFA), a fast, flow-based policy that enables one-step action generation while keeping observations aligned to ground-truth actions. SeFA combines a base rectified flow with a selective alignment mechanism that uses expert demonstrations to correct actions only when needed, preserving multimodal action distributions and reducing cumulative drift. The approach achieves substantially lower inference latency (over 98% faster than diffusion baselines) and superior robustness and accuracy across 66 simulated tasks and 7 real-world manipulations, outperforming diffusion-based and AdaFlow baselines. By unifying efficient rectified flow with observation-consistent action generation, SeFA offers a scalable solution for real-time visuomotor control in complex robotics, with code available at the project repository. Key mathematical elements include the drift-based flow $\mathrm{d}a_t = v(a_t,t,\mathbf{O})\mathrm{d}t$, the linear interpolation $\mathbf{a}_t = \frac{t}{T}\mathbf{a}_T + \frac{T-t}{T}\mathbf{a}_0$, and the selective replacement condition $\|\mathbf{a}_0^{\text{SeFA}} - \mathbf{a}_0^*\| < \epsilon$.

Abstract

Developing efficient and accurate visuomotor policies poses a central challenge in robotic imitation learning. While recent rectified flow approaches have advanced visuomotor policy learning, they suffer from a key limitation: After iterative distillation, generated actions may deviate from the ground-truth actions corresponding to the current visual observation, leading to accumulated error as the reflow process repeats and unstable task execution. We present Selective Flow Alignment (SeFA), an efficient and accurate visuomotor policy learning framework. SeFA resolves this challenge by a selective flow alignment strategy, which leverages expert demonstrations to selectively correct generated actions and restore consistency with observations, while preserving multimodality. This design introduces a consistency correction mechanism that ensures generated actions remain observation-aligned without sacrificing the efficiency of one-step flow inference. Extensive experiments across both simulated and real-world manipulation tasks show that SeFA Policy surpasses state-of-the-art diffusion-based and flow-based policies, achieving superior accuracy and robustness while reducing inference latency by over 98%. By unifying rectified flow efficiency with observation-consistent action generation, SeFA provides a scalable and dependable solution for real-time visuomotor policy learning. Code is available on https://github.com/RongXueZoe/SeFA.

SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment

TL;DR

This work addresses the challenge of real-time visuomotor imitation by proposing Selective Flow Alignment (SeFA), a fast, flow-based policy that enables one-step action generation while keeping observations aligned to ground-truth actions. SeFA combines a base rectified flow with a selective alignment mechanism that uses expert demonstrations to correct actions only when needed, preserving multimodal action distributions and reducing cumulative drift. The approach achieves substantially lower inference latency (over 98% faster than diffusion baselines) and superior robustness and accuracy across 66 simulated tasks and 7 real-world manipulations, outperforming diffusion-based and AdaFlow baselines. By unifying efficient rectified flow with observation-consistent action generation, SeFA offers a scalable solution for real-time visuomotor control in complex robotics, with code available at the project repository. Key mathematical elements include the drift-based flow , the linear interpolation , and the selective replacement condition .

Abstract

Developing efficient and accurate visuomotor policies poses a central challenge in robotic imitation learning. While recent rectified flow approaches have advanced visuomotor policy learning, they suffer from a key limitation: After iterative distillation, generated actions may deviate from the ground-truth actions corresponding to the current visual observation, leading to accumulated error as the reflow process repeats and unstable task execution. We present Selective Flow Alignment (SeFA), an efficient and accurate visuomotor policy learning framework. SeFA resolves this challenge by a selective flow alignment strategy, which leverages expert demonstrations to selectively correct generated actions and restore consistency with observations, while preserving multimodality. This design introduces a consistency correction mechanism that ensures generated actions remain observation-aligned without sacrificing the efficiency of one-step flow inference. Extensive experiments across both simulated and real-world manipulation tasks show that SeFA Policy surpasses state-of-the-art diffusion-based and flow-based policies, achieving superior accuracy and robustness while reducing inference latency by over 98%. By unifying rectified flow efficiency with observation-consistent action generation, SeFA provides a scalable and dependable solution for real-time visuomotor policy learning. Code is available on https://github.com/RongXueZoe/SeFA.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of SeFA. We train a visuomotor policy in an iterative manner to transport straight between noise distribution and target action space, hence enabling lightning one-step sampling during inference. The action flow is selectively aligned with observations, lowering the potential accumulated error brought by multiple reflows.
  • Figure 2: Sampling trajectories of SeFA-Policy at different stages. Randomly sampled pairs in (a) have crossing flows. Couplings in (b) have been rewired so they do not intersect with each other at the same denoising timestep. The trajectories in (c) and (d) are nearly straight.
  • Figure 3: Success Rates on Adroit (%).
  • Figure 4: Grasping different objects with one policy. SeFA trained on the apple can generalize to other objects (cube, rubber duck) with similar sizes and locations.
  • Figure 5: Floating object manipulation. SeFA dynamically adjusts its action trajectory to approach and grab the moving rubber duck on the water, which demonstrates generalization ability to different object locations.
  • ...and 2 more figures