How Do VLAs Effectively Inherit from VLMs?
Chuheng Zhang, Rushuai Yang, Xiaoyu Chen, Kaixin Wang, Li Zhao, Yi Chen, Jiang Bian
TL;DR
The paper tackles the problem of how vision-language-action models can effectively inherit the priors of large vision-language models for embodied control, addressing catastrophic forgetting during transfer. It introduces GrinningFace, a diagnostic emoji-tabletop benchmark that disentangles visual-semantic priors from motor skills, and validates it in both simulation and real robots. Through systematic experiments, the study compares pre-training and fine-tuning strategies (e.g., parameter-efficient methods like LoRA, freezing backbones, co-training, discretized vs latent targets, and diverse data) to identify what preserves VLM priors while enabling motor execution. Key findings show that co-training and latent-action targets help preserve priors and improve recognition, while naive initialization or discretized targets underperform; diverse VLA pre-training further aids transfer. The work provides actionable guidance and a reproducible framework for developing truly generalizable embodied AI systems, highlighting substantial room for improving how VLM priors are preserved and activated during VLA training.
Abstract
Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.
