Table of Contents
Fetching ...

Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA

Zihua Wang, Zhitao Lin, Ruibo Li, Yu Zhang, Xu Yang, Siya Mi, Xiu-Shen Wei

Abstract

Vision-Language-Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open-loop execution. Although effective for reducing computation, open-loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close-loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV-VLA), a framework that combines efficient open-loop long-horizon planning with lightweight closed-loop online verification. Specifically, SV-VLA uses a heavy VLA as a low-frequency macro-planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed-loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV-VLA combines the efficiency of chunked prediction with the robustness of closed-loop control, enabling efficient and reliable VLA-based control in dynamic environments. Code is available: https://github.com/edsad122/SV-VLA.

Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA

Abstract

Vision-Language-Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open-loop execution. Although effective for reducing computation, open-loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close-loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV-VLA), a framework that combines efficient open-loop long-horizon planning with lightweight closed-loop online verification. Specifically, SV-VLA uses a heavy VLA as a low-frequency macro-planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed-loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV-VLA combines the efficiency of chunked prediction with the robustness of closed-loop control, enabling efficient and reliable VLA-based control in dynamic environments. Code is available: https://github.com/edsad122/SV-VLA.

Paper Structure

This paper contains 20 sections, 10 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of action chunking, speculative decoding, and our proposed Speculative Verification VLA (SV-VLA). (a) Action Chunking VLA predicts and executes an action chunk in an open-loop manner, so later actions may rely on stale observations. (b) Speculative Decoding VLA employs a draft model to generate candidate actions, and a heavy target model to verify them in parallel. However, the verified chunk is still executed based on stale observations. (c) Speculative Verification VLA (SV-VLA) performs chunk-level macro planning together with lightweight and frequent verification under online updated observations. During execution, it continuously verifies whether the current planned action remains valid under the latest observation and triggers replanning once a mismatch is detected. It combines the efficiency of chunk-level planning with the adaptability of closed-loop feedback, improving both responsiveness and robustness in dynamic environments.
  • Figure 2: Overview of SV-VLA. At each planning boundary $T_0$, a frozen heavy VLA takes the current observation $I_0$, language instruction $L$, and proprioceptive state $s_0$ as input, and outputs a macro action chunk $A^{macro}$ together with a planning context feature $F_0$. During execution, a lightweight verifier runs at control frequency. Given the latest observation $I_1$ and the planning context feature $F_0$, it predicts a reference action $a_1'$, which is compared with the current planned action $a_1$ from the macro chunk. If the discrepancy is below a threshold, the planned action $a_1$ is accepted and executed; otherwise, it is rejected, the remaining chunk is discarded, and the heavy VLA replans from the current state.
  • Figure 3: Qualitative comparison on the task: "pick up the black bowl between the plate and the ramekin and place it on the plate." SV-VLA detects during execution that the bowl has not been successfully grasped, interrupts the current macro action chunk, and replans from the latest observation, which enables successful task completion. In contrast, the open-loop baseline with $K=64$ executes the entire action chunk without correction and fails to recover from the grasping error, while the $K=8$ baseline remains robust through frequent replanning at a much higher inference cost.