Table of Contents
Fetching ...

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone

TL;DR

The work tackles the misalignment between vision–language–action policies and natural language instructions by introducing CoVer, a contrastive verifier that enables test-time scaling for VLAs. CoVer is trained offline with a large dataset on a 1B-parameter backbone and deployed via boot-time compute and a hierarchical language–action optimization pipeline, allowing instruction rephrasing and action probing to be selected based on semantic alignment. Empirically,CoVer yields substantial gains on SIMPLER (in-distribution 22%, out-of-distribution 13%, real-world 45%), as well as improvements on PolaRiS (14% task progress, 9% success rate), outperforming policy-scaling baselines with far lower additional compute. The findings demonstrate that deploy-time reasoning and verification can be more effective than ongoing policy pre-training, suggesting a shift in how robotics systems balance training and runtime resources for robust instruction following.

Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

TL;DR

The work tackles the misalignment between vision–language–action policies and natural language instructions by introducing CoVer, a contrastive verifier that enables test-time scaling for VLAs. CoVer is trained offline with a large dataset on a 1B-parameter backbone and deployed via boot-time compute and a hierarchical language–action optimization pipeline, allowing instruction rephrasing and action probing to be selected based on semantic alignment. Empirically,CoVer yields substantial gains on SIMPLER (in-distribution 22%, out-of-distribution 13%, real-world 45%), as well as improvements on PolaRiS (14% task progress, 9% success rate), outperforming policy-scaling baselines with far lower additional compute. The findings demonstrate that deploy-time reasoning and verification can be more effective than ongoing policy pre-training, suggesting a shift in how robotics systems balance training and runtime resources for robust instruction following.

Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
Paper Structure (39 sections, 8 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 39 sections, 8 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: We present CoVer-VLA, a contrastive verification framework for vision–language–action alignment. CoVer is trained entirely offline on large-scale robotics datasets with contrastive representation learning. It supports zero-shot alignment verification for generalist robot policies out of the box. At test-time, CoVer can be used to perform instruction optimization and action verification, improving downstream performance for VLAs.
  • Figure 2: Hierarchical Test-Time Verification Pipeline. Left: Given the initial observation and language instruction, a VLM performs structured reasoning over the scene and precomputes a set of rephrased instructions during boot time. At each step during deployment, our framework generates a batch of action candidates for each instruction using a VLA. Middle: CoVer then scores all instruction–action pairs and selects the optimal high-level instruction and low-level action chunk for execution. Right: Compared to prior work on scaling policy learning black$p_0$VisionLanguageActionFlow2024, our approach achieves stronger performance while requiring substantially less compute. The reported training compute for $\pi_0$ includes both pre-training and fine-tuning on augmented instruction sets, whereas $\pi_0$ + CoVer accounts for pre-training $\pi_0$ and training the CoVer verifier on the same data.
  • Figure 3: Test-Time Scaling Law for Embodied Instruction Following. Compared to prior methods that obtain diverse actions through repeated sampling nakamoto2024steering or Gaussian perturbations kwok25robomonkey, we find that instruction rephrasing produces a broader set of action candidates, leading to improved recovery of the correct action. Furthermore, a hybrid test-time scaling strategy that increases both the number of rephrases and the number of sampled actions per rephrase is more effective than either strategy alone. We characterize each sampling approach using a power law, where the logarithm of oracle action error $e$ is a function of the number of action candidates $k$: $\log(e) \approx \log(a) + b \cdot \log(k)$.
  • Figure 4: Overview of CoVer Training Strategy. CoVer learns a joint embedding space aligning visual observations, language instructions, and robot actions through contrastive pre-training. Image and text encoders extract task-relevant visual–linguistic features, which are fused into text-aware visual representations. An action encoder maps action sequences into the same embedding space, enabling cross-modal alignment between instructions and executed behaviors.
  • Figure 5: Overview of Test-Time Verification Pipeline. At deployment, the system performs hierarchical optimization over language and action spaces. Given a user prompt and the initial observation, a VLM first reasons over the scene and generates a set of rephrased prompts at boot time. For each rephrase, a VLA samples action candidates conditioned on the corresponding instruction. The trained CoVer verifier then scores all instruction–action pairs and selects the optimal prompt and action for execution.
  • ...and 4 more figures