Table of Contents
Fetching ...

VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation

Xuanran Zhai, Qianyou Zhao, Qiaojun Yu, Ce Hao

TL;DR

The paper tackles the velocity-ambiguity and mode-collapse issues in flow-matching policies for multi-modal robot manipulation. It introduces Variational Flow Matching Policy (VFP), which uses a diffusion-based latent prior to index mode-specific action flows, a Kantorovich-OT regularizer for distribution-level alignment, and a Mixture-of-Experts (MoE) decoder for efficient, mode-specific inference. The approach reduces irreducible ambiguity by attributing mode structure to a latent variable and enforcing alignment with expert distributions at the population level, yielding substantial improvements over baselines in both simulation (e.g., 49% relative task success gain) and real-robot tasks while maintaining fast inference. Extensive experiments across 41 simulated tasks and 3 real-robot tasks demonstrate VFP’s effectiveness in capturing task- and path-level multi-modality, as well as its efficiency and robustness, with ablations confirming the critical roles of K-OT and MoE components.

Abstract

Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size. More details are available on our project page: https://sites.google.com/view/varfp/

VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation

TL;DR

The paper tackles the velocity-ambiguity and mode-collapse issues in flow-matching policies for multi-modal robot manipulation. It introduces Variational Flow Matching Policy (VFP), which uses a diffusion-based latent prior to index mode-specific action flows, a Kantorovich-OT regularizer for distribution-level alignment, and a Mixture-of-Experts (MoE) decoder for efficient, mode-specific inference. The approach reduces irreducible ambiguity by attributing mode structure to a latent variable and enforcing alignment with expert distributions at the population level, yielding substantial improvements over baselines in both simulation (e.g., 49% relative task success gain) and real-robot tasks while maintaining fast inference. Extensive experiments across 41 simulated tasks and 3 real-robot tasks demonstrate VFP’s effectiveness in capturing task- and path-level multi-modality, as well as its efficiency and robustness, with ablations confirming the critical roles of K-OT and MoE components.

Abstract

Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size. More details are available on our project page: https://sites.google.com/view/varfp/

Paper Structure

This paper contains 30 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) In Franka Kitchen environment, the robot finishes multiple tasks and each task has different paths, which requires the policy to capture the task and path multi-modality (MM). (b) In the avoiding task, the vanilla flow-matching (FM) policy learns the mean action over all demonstrations, while our variational flow-matching policy (VFP) successfully follows the MM path distribution.
  • Figure 2: Overview of VFP. (A) Model Pipeline: The model consists of a latent-conditioned MoE flow matching network. A prior network generates latent variables from input states, which guide the MoE decoder to predict actions efficiently. During training, a posterior encoder is used for variational learning. (B) Latent-Instructed Mode Identification: Visualization of how VFP enables mode-specific behavior. Compared to the collapsed predictions of vanilla flow matching, our model captures distinct modes conditioned on latent variables. (C) Latent Shaping and Mode Decoupling via OT: Optimal Transport regularization improves mode separation in the latent space, aligning latent modes with distinct action modes.
  • Figure 3: The Avoiding task and behaviors of policies. (a): The environment of avoiding. (b): Behavior of FlowPolicy (red trajectories). (c): Behavior of VFP (ours). Movements made by different experts are in different colors.
  • Figure 4: Tasks in (a) Adroit and (b) Meta-World.
  • Figure 5: Experiments on Real-World Avoiding, Cups Nesting and Tubes Placement.
  • ...and 1 more figures