SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment
Rong Xue, Jiageng Mao, Mingtong Zhang, Yue Wang
TL;DR
This work addresses the challenge of real-time visuomotor imitation by proposing Selective Flow Alignment (SeFA), a fast, flow-based policy that enables one-step action generation while keeping observations aligned to ground-truth actions. SeFA combines a base rectified flow with a selective alignment mechanism that uses expert demonstrations to correct actions only when needed, preserving multimodal action distributions and reducing cumulative drift. The approach achieves substantially lower inference latency (over 98% faster than diffusion baselines) and superior robustness and accuracy across 66 simulated tasks and 7 real-world manipulations, outperforming diffusion-based and AdaFlow baselines. By unifying efficient rectified flow with observation-consistent action generation, SeFA offers a scalable solution for real-time visuomotor control in complex robotics, with code available at the project repository. Key mathematical elements include the drift-based flow $\mathrm{d}a_t = v(a_t,t,\mathbf{O})\mathrm{d}t$, the linear interpolation $\mathbf{a}_t = \frac{t}{T}\mathbf{a}_T + \frac{T-t}{T}\mathbf{a}_0$, and the selective replacement condition $\|\mathbf{a}_0^{\text{SeFA}} - \mathbf{a}_0^*\| < \epsilon$.
Abstract
Developing efficient and accurate visuomotor policies poses a central challenge in robotic imitation learning. While recent rectified flow approaches have advanced visuomotor policy learning, they suffer from a key limitation: After iterative distillation, generated actions may deviate from the ground-truth actions corresponding to the current visual observation, leading to accumulated error as the reflow process repeats and unstable task execution. We present Selective Flow Alignment (SeFA), an efficient and accurate visuomotor policy learning framework. SeFA resolves this challenge by a selective flow alignment strategy, which leverages expert demonstrations to selectively correct generated actions and restore consistency with observations, while preserving multimodality. This design introduces a consistency correction mechanism that ensures generated actions remain observation-aligned without sacrificing the efficiency of one-step flow inference. Extensive experiments across both simulated and real-world manipulation tasks show that SeFA Policy surpasses state-of-the-art diffusion-based and flow-based policies, achieving superior accuracy and robustness while reducing inference latency by over 98%. By unifying rectified flow efficiency with observation-consistent action generation, SeFA provides a scalable and dependable solution for real-time visuomotor policy learning. Code is available on https://github.com/RongXueZoe/SeFA.
