VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation
Xuanran Zhai, Qianyou Zhao, Qiaojun Yu, Ce Hao
TL;DR
The paper tackles the velocity-ambiguity and mode-collapse issues in flow-matching policies for multi-modal robot manipulation. It introduces Variational Flow Matching Policy (VFP), which uses a diffusion-based latent prior to index mode-specific action flows, a Kantorovich-OT regularizer for distribution-level alignment, and a Mixture-of-Experts (MoE) decoder for efficient, mode-specific inference. The approach reduces irreducible ambiguity by attributing mode structure to a latent variable and enforcing alignment with expert distributions at the population level, yielding substantial improvements over baselines in both simulation (e.g., 49% relative task success gain) and real-robot tasks while maintaining fast inference. Extensive experiments across 41 simulated tasks and 3 real-robot tasks demonstrate VFP’s effectiveness in capturing task- and path-level multi-modality, as well as its efficiency and robustness, with ablations confirming the critical roles of K-OT and MoE components.
Abstract
Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size. More details are available on our project page: https://sites.google.com/view/varfp/
