VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
TL;DR
The paper identifies Spatial Modeling misalignment as the key cause of VLA robustness failures under novel viewpoints, showing that pretrained visuolinguistic policies retain substantial latent robustness in their physical reasoning. It introduces two lightweight one-shot adaptations, Feature Token Modulation (FTM) and Feature Linear Adaptation (FLA), which recalibrate visual representations with minimal parameter updates. Empirically, FTM and especially FLA achieve state-of-the-art viewpoint and perturbation generalization on Libero-V with far fewer trainable parameters than LoRA, and the authors provide theoretical and qualitative analysis to explain why small affine or low-rank corrections suffice. The results imply that robust embodied perception can be unlocked from existing VLA models through targeted, efficient visual-space realignment, enabling practical deployment without large-scale retraining or data collection.
Abstract
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
