Table of Contents
Fetching ...

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang

TL;DR

The paper identifies Spatial Modeling misalignment as the key cause of VLA robustness failures under novel viewpoints, showing that pretrained visuolinguistic policies retain substantial latent robustness in their physical reasoning. It introduces two lightweight one-shot adaptations, Feature Token Modulation (FTM) and Feature Linear Adaptation (FLA), which recalibrate visual representations with minimal parameter updates. Empirically, FTM and especially FLA achieve state-of-the-art viewpoint and perturbation generalization on Libero-V with far fewer trainable parameters than LoRA, and the authors provide theoretical and qualitative analysis to explain why small affine or low-rank corrections suffice. The results imply that robust embodied perception can be unlocked from existing VLA models through targeted, efficient visual-space realignment, enabling practical deployment without large-scale retraining or data collection.

Abstract

Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

TL;DR

The paper identifies Spatial Modeling misalignment as the key cause of VLA robustness failures under novel viewpoints, showing that pretrained visuolinguistic policies retain substantial latent robustness in their physical reasoning. It introduces two lightweight one-shot adaptations, Feature Token Modulation (FTM) and Feature Linear Adaptation (FLA), which recalibrate visual representations with minimal parameter updates. Empirically, FTM and especially FLA achieve state-of-the-art viewpoint and perturbation generalization on Libero-V with far fewer trainable parameters than LoRA, and the authors provide theoretical and qualitative analysis to explain why small affine or low-rank corrections suffice. The results imply that robust embodied perception can be unlocked from existing VLA models through targeted, efficient visual-space realignment, enabling practical deployment without large-scale retraining or data collection.

Abstract

Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.

Paper Structure

This paper contains 45 sections, 3 theorems, 28 equations, 9 figures, 7 tables.

Key Result

Theorem 1

Let be a representative source-domain token. Under Assumption A1,

Figures (9)

  • Figure 1: Visualization of a sample rollout. Each row shows how observations from different viewpoints evolve over time (columns). The rollout highlights the adaptability of our $\pi_{0.5}$ method with One-Shot Feature Linear Adaptation.
  • Figure 2: Comparison of methods for adapting to new visual perturbations in VLA models. Panels (a) and (b) show existing finetuning-based approaches for adapting VLA models to new visual features. Panel (c) illustrates a potential meta-learning strategy using concatenated learnable prompts. Panels (d) and (e) present our proposed methods—Feature Linear Modulation and Feature Linear Adaptation.
  • Figure 3: Overview of feature adaptation and benchmark design. (a) Illustration of the proposed Feature Token Modulation mechanism. The dashed box denotes components used only during training and removed at inference. (b) The LIBERO-V (Visual) benchmark, created by combining multiple levels of viewpoint variation wilcox2025adapt3radaptive3dscene with visual-perturbation tasks from LIBERO-Plus fei2025liberoplusindepthrobustnessanalysis. It covers four perturbation types: camera viewpoint, lighting, background texture, and image noise.
  • Figure 4: Success rates before and after adaptation on the LIBERO benchmark under novel camera viewpoints. We report Success Rate (SR) across all unseen viewpoints wilcox2025adapt3radaptive3dscene in the LIBERO suites liu2023liberobenchmarkingknowledgetransfer. “Before’’ corresponds to the zero-shot performance of pretrained policies without any adaptation.
  • Figure 5: Qualitative Results on Real-World Tasks. Sample rollouts of the five evaluation tasks executed by our policy after one-shot FLA adaptation. The tasks include: (1) Pick up the red block, (2) Stack the red block on the green block, (3) Close the microwave oven door, (4) Press the green button, and (5) Pull out the top drawer. Despite the substantial visual discrepancy introduced by the novel viewpoint, the adapted policy successfully recovers spatial grounding and executes precise manipulation in a closed-loop manner.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 1: Policy degradation is upper bounded by representation drift
  • proof
  • Theorem 2: Existence of affine corrections (justification of FTM)
  • proof
  • Theorem 3: Low-rank corrections approximate optimal linear shift (justification of FLA)
  • proof