Table of Contents
Fetching ...

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chao Yu, ZhiJian Mo, Qihua Xiao, XiaoShuai Peng, Qingmin Liao, Yu Wang

Abstract

Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Abstract

Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 latency speedup and reduces execution halting by 6.5 .

Paper Structure

This paper contains 18 sections, 12 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: StreamingVLA explores streaming execution of VLA models by enabling different stage to be runned in a asynchronous manner. By overlapping the latency for stages, without sacrificing performance, it achieves 2.4$\times$ end-to-end speedup, and 6.5 $\times$ execution halting time reduction, enabling fast and fluent execution.
  • Figure 2: The overall methodology framework for StreamingVLA: We conduct a systematic timeline analysis and conclude the optimization target for fast and fluent execution. We present two key techniques: action flow matching and adaptive early observation, to overlap the latency of action execution with of action generation and VLM observation, respectively.
  • Figure 3: The Illustration of state-based modeling of action flow matching: It reformulates the action modeling from predicting actions as absolute values to predicting actions as updates to a feature-space state accumulated by prior actions. Extended formulation and architectural adjustment is adopted for larger scaled VLA model and mainstream benchmarks.
  • Figure 4: The Illustration of action saliency aware adaptive early observation: We highlight the importance of diverse action saliency, and adopt a lightweight predictor to implement an adaptive early observation scheme.
  • Figure 5: Real-world Setup. The real-world evaluation platform includes a tabletop workspace, a Franka Panda robotic manipulator fixed to the table, and a RGB camera for visual observation. All objects are positioned on the table surface.
  • ...and 3 more figures