Table of Contents
Fetching ...

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, Jing Zhang

Abstract

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Abstract

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
Paper Structure (36 sections, 5 equations, 4 figures, 13 tables)

This paper contains 36 sections, 5 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Overview Inference Framework (a) Auto-regression-based VLA models generate actions via token-by-token decoding. (b) diffusion-based VLA models predict continuous actions using an additional action expert conditioned on feature representations extracted by the VLM. (c) Ours: In the draft phase, the action expert takes the VLM hidden states as input and generates multiple candidate action trajectories. In the verification phase, the VLM scores all candidates in a single forward pass and selects the trajectory with the best (lowest) perplexity-style score as the final output.
  • Figure 2: Examples of confidence scoring for action trajectories across all benchmark environments. While the VLM verifier sometimes fails to select the most precise action (b-red&b-blue; d-blue&d-brown), it effectively filters out actions that are evidently erroneous (d-red) or insufficiently direct (a-brown, c-blue).
  • Figure 3: Examples of four discrete action token representation methods. Assume VLM tokenizer encode '\\ n' to 198, '0' to 14, '1' to 15, '2' to 16 and so on.
  • Figure 4: Partial real machine environment and instruction diagram. Our testing is based on the xArm robotic arm and DaHuan gripper. In real-world we collect 2000 trajectories and select for testing in this work.