Table of Contents
Fetching ...

DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation

Guowei Zou, Haitao Wang, Hejun Wu, Yukun Qian, Yuhang Wang, Weibing Li

TL;DR

DM1 introduces a MeanFlow-based policy augmented with dispersive regularization to enable true one-step action generation in vision-based robotic manipulation. Dispersive losses are applied to intermediate embeddings ($H^{(T)}$, $H^{(R)}$, $H^{(\text{Cond})}$) to prevent representation collapse without architectural changes, and four variants (InfoNCE-L2, InfoNCE-Cosine, Hinge, Covariance) are evaluated. On RoboMimic benchmarks, DM1 delivers $20$–$40\times$ faster inference and $10$–$20$ percentage point gains in success, with Lift approaching near-perfect performance, and real-robot validation on a Franka-Emika-Panda confirms sim-to-real transfer and real-time control above $50$ Hz. These results indicate that representation regularization can sustain multimodal control signals in flow-based policies, enabling practical, real-time manipulation.

Abstract

The ability to learn multi-modal action distributions is indispensable for robotic manipulation policies to perform precise and robust control. Flow-based generative models have recently emerged as a promising solution to learning distributions of actions, offering one-step action generation and thus achieving much higher sampling efficiency compared to diffusion-based methods. However, existing flow-based policies suffer from representation collapse, the inability to distinguish similar visual representations, leading to failures in precise manipulation tasks. We propose DM1 (MeanFlow with Dispersive Regularization for One-Step Robotic Manipulation), a novel flow matching framework that integrates dispersive regularization into MeanFlow to prevent collapse while maintaining one-step efficiency. DM1 employs multiple dispersive regularization variants across different intermediate embedding layers, encouraging diverse representations across training batches without introducing additional network modules or specialized training procedures. Experiments on RoboMimic benchmarks show that DM1 achieves 20-40 times faster inference (0.07s vs. 2-3.5s) and improves success rates by 10-20 percentage points, with the Lift task reaching 99% success over 85% of the baseline. Real-robot deployment on a Franka Panda further validates that DM1 transfers effectively from simulation to the physical world. To the best of our knowledge, this is the first work to leverage representation regularization to enable flow-based policies to achieve strong performance in robotic manipulation, establishing a simple yet powerful approach for efficient and robust manipulation.

DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation

TL;DR

DM1 introduces a MeanFlow-based policy augmented with dispersive regularization to enable true one-step action generation in vision-based robotic manipulation. Dispersive losses are applied to intermediate embeddings (, , ) to prevent representation collapse without architectural changes, and four variants (InfoNCE-L2, InfoNCE-Cosine, Hinge, Covariance) are evaluated. On RoboMimic benchmarks, DM1 delivers faster inference and percentage point gains in success, with Lift approaching near-perfect performance, and real-robot validation on a Franka-Emika-Panda confirms sim-to-real transfer and real-time control above Hz. These results indicate that representation regularization can sustain multimodal control signals in flow-based policies, enabling practical, real-time manipulation.

Abstract

The ability to learn multi-modal action distributions is indispensable for robotic manipulation policies to perform precise and robust control. Flow-based generative models have recently emerged as a promising solution to learning distributions of actions, offering one-step action generation and thus achieving much higher sampling efficiency compared to diffusion-based methods. However, existing flow-based policies suffer from representation collapse, the inability to distinguish similar visual representations, leading to failures in precise manipulation tasks. We propose DM1 (MeanFlow with Dispersive Regularization for One-Step Robotic Manipulation), a novel flow matching framework that integrates dispersive regularization into MeanFlow to prevent collapse while maintaining one-step efficiency. DM1 employs multiple dispersive regularization variants across different intermediate embedding layers, encouraging diverse representations across training batches without introducing additional network modules or specialized training procedures. Experiments on RoboMimic benchmarks show that DM1 achieves 20-40 times faster inference (0.07s vs. 2-3.5s) and improves success rates by 10-20 percentage points, with the Lift task reaching 99% success over 85% of the baseline. Real-robot deployment on a Franka Panda further validates that DM1 transfers effectively from simulation to the physical world. To the best of our knowledge, this is the first work to leverage representation regularization to enable flow-based policies to achieve strong performance in robotic manipulation, establishing a simple yet powerful approach for efficient and robust manipulation.

Paper Structure

This paper contains 30 sections, 24 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Visualization of the effect of dispersive regularization in DM1. (a) Example rollouts showing how similar observations can lead to incorrect vs. correct grasps; (b–c) Feature distributions without and with dispersive loss, where dispersion prevents representation collapse; (d) Method landscape illustrating the speed–quality trade-off; and (e) Quantitative comparison of success rate versus inference time, showing DM1’s superior efficiency and accuracy.
  • Figure 2: DM1 Framework Architecture. Top Left: 1-Step Action Generation showing MeanFlow's core principle of direct trajectory generation through average velocity fields, contrasting with traditional multi-step denoising approaches. Bottom Left: Vision Transformer Encoder processing input images into patch tokens with positional encoding for global visual feature extraction. Top Right: Dispersive Loss components (R Disp., T Disp., Cond Disp.) encouraging embedding separation across different modalities to prevent representation collapse. Bottom Right: Complete DM1 computational flow integrating vision input, state input, and temporal conditions through embedding modules, with dispersive losses applied to intermediate representations and MeanFlow loss for velocity field prediction.
  • Figure 3: Comprehensive evaluation of success rates across varying denoising steps for different weight configurations ($\alpha_{\text{disp}} = 0.1, 0.5, 0.9$) and four robotic manipulation tasks. Each row represents a specific task (Lift, Can, Square, Transport) while columns show different weight factors. The analysis demonstrates the superior performance of our MeanFlow-based approaches (MF, MF+Disp) which achieve competitive success rates with significantly fewer denoising steps (5 steps) compared to baseline methods requiring 32--128 steps.
  • Figure 4: Success rate vs. inference frequency trade-off across different weighting factors ($\alpha_{\text{disp}} = 0.1, 0.5, 0.9$) and four tasks. Each point represents a method's performance, with position indicating the trade-off between task success (y-axis) and computational efficiency (x-axis, left = faster). Our methods achieve consistent improvements in both dimensions compared to baselines (ShortCut, ReFlow).
  • Figure 5: Real-world deployment on a Franka-Emika-Panda robot for Lift (red cube) and Can (Coca-Cola can) tasks. Each task shows wrist-mounted first-person views (top), successful trials with third-person views (middle, green checkmarks), and failure cases (bottom, red crosses). The dual-view visualization enables direct comparison between successful and failed execution.