Mean-Flow based One-Step Vision-Language-Action

Yang Chen; Xiaoguang Ma; Bin Zhao

Mean-Flow based One-Step Vision-Language-Action

Yang Chen, Xiaoguang Ma, Bin Zhao

TL;DR

This work resolves the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods and significantly enhances generation efficiency and enables one-step action generation.

Abstract

Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.

Mean-Flow based One-Step Vision-Language-Action

TL;DR

Abstract

Paper Structure (16 sections, 19 equations, 4 figures, 5 tables)

This paper contains 16 sections, 19 equations, 4 figures, 5 tables.

INTRODUCTION
RELATED WORK
Diffusion-based Visual-Language-Action Models
FlowMatching-based Visual-Language-Action Models
Generation Efficiency Improvement
METHODOLOGY
Introducing MeanFlow into VLA
Action Generation Strategy
EXPERIMENTS
Parameter Sensitivity
Flow Ratio
Loss Metric
Number of Function Evaluations(NFE)
Action Chunk Size
Ablation Study
...and 1 more sections

Figures (4)

Figure 1: Overview of the Mean-Flow based One-Step VLA framework. The pretrained VLM processes multimodal inputs. During training, the Mean-Flow action expert approximates the mean denoising vector field conditioned on VLM features and predicts the mean vector field to obtain the Action $A_0$ during inference.
Figure 2: Illustrations of Real-world robotic arm and three real manipulation tasks.
Figure 3: MeanFlow under various $flow\text{-}ratio$.
Figure 4: Illustrations showing the average action-generation speed of Diffusion Policy, SmolVLA, and One-Step VLA across three real manipulation tasks. The One-Step VLA is 8.7 times faster than SmolVLA and 83.9 times faster than Diffusion Policy.

Mean-Flow based One-Step Vision-Language-Action

TL;DR

Abstract

Mean-Flow based One-Step Vision-Language-Action

Authors

TL;DR

Abstract

Table of Contents

Figures (4)