Table of Contents
Fetching ...

ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training

Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, Dieter Fox

TL;DR

ManiFlow tackles the challenge of general robot manipulation by learning a dexterous visuomotor policy from multi-modal observations. It combines flow matching with a continuous-time consistency objective and deploys a DiT-X Transformer with AdaLN-Zero conditioning to efficiently fuse visual, language, and proprioceptive cues. In extensive simulations and real-world experiments, ManiFlow outperforms diffusion-based and prior flow-matching policies, achieving higher success rates, faster few-step inference, and strong generalization to unseen objects and backgrounds. The work also provides thorough ablations on time-sampling strategies and perceptual encoders, and demonstrates scalable performance with larger demonstration datasets.

Abstract

This paper introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets. Our website: maniflow-policy.github.io.

ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training

TL;DR

ManiFlow tackles the challenge of general robot manipulation by learning a dexterous visuomotor policy from multi-modal observations. It combines flow matching with a continuous-time consistency objective and deploys a DiT-X Transformer with AdaLN-Zero conditioning to efficiently fuse visual, language, and proprioceptive cues. In extensive simulations and real-world experiments, ManiFlow outperforms diffusion-based and prior flow-matching policies, achieving higher success rates, faster few-step inference, and strong generalization to unseen objects and backgrounds. The work also provides thorough ablations on time-sampling strategies and perceptual encoders, and demonstrates scalable performance with larger demonstration datasets.

Abstract

This paper introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets. Our website: maniflow-policy.github.io.

Paper Structure

This paper contains 28 sections, 2 equations, 19 figures, 9 tables, 2 algorithms.

Figures (19)

  • Figure 1: Policy Architecture of ManiFlow. Our system processes 2D or 3D visual observations, robot state, or language as inputs and outputs a sequence of actions. We leverage a DiT-X transformer architecture to efficiently optimize a flow matching model with a continuous-time consistency training objective, ensuring high-quality action generation for challenging dexterous tasks.
  • Figure 2: ManiFlow Consistency Training. Given a flow path that smoothly transforms action to noise, we sample multiple intermediate points via linear interpolation (e.g., $x_{t}$, $x_{t_1}$, and $x_{t_2}$). During training, we learn to map any intermediate point on the flow trajectory back to its origin $x_1$ and ensure the self-consistency of sampled points on the same trajectory.
  • Figure 3: DiT-X Block. Unlike DiT (self-attention only) and MDT (basic cross-attention), DiT-X applies AdaLN-Zero conditioning to low-dimensional robot state inputs, and adjusts cross-attention input and output with learned scaling and shift parameters, ensuring adaptive and fine-grained feature interactions between action tokens and multi-modal input tokens. This design enables efficient handling of both low-dimensional control signals and high-dimensional perceptual inputs.
  • Figure 4: Training action error and success rate of DiT-X vs w/o cross-attention AdaLN-zero conditioning in 10 Metaworld tasks with language conditioning.
  • Figure 5: Comparison on language-conditioned multi-task learning on 48 MetaWorld tasks. ManiFlow achieves superior performance across all difficulty levels compared to the 3D diffusion and flow matching policy, with an average 31.4% and 34.9% relative improvement.
  • ...and 14 more figures