Table of Contents
Fetching ...

How Do VLAs Effectively Inherit from VLMs?

Chuheng Zhang, Rushuai Yang, Xiaoyu Chen, Kaixin Wang, Li Zhao, Yi Chen, Jiang Bian

TL;DR

The paper tackles the problem of how vision-language-action models can effectively inherit the priors of large vision-language models for embodied control, addressing catastrophic forgetting during transfer. It introduces GrinningFace, a diagnostic emoji-tabletop benchmark that disentangles visual-semantic priors from motor skills, and validates it in both simulation and real robots. Through systematic experiments, the study compares pre-training and fine-tuning strategies (e.g., parameter-efficient methods like LoRA, freezing backbones, co-training, discretized vs latent targets, and diverse data) to identify what preserves VLM priors while enabling motor execution. Key findings show that co-training and latent-action targets help preserve priors and improve recognition, while naive initialization or discretized targets underperform; diverse VLA pre-training further aids transfer. The work provides actionable guidance and a reproducible framework for developing truly generalizable embodied AI systems, highlighting substantial room for improving how VLM priors are preserved and activated during VLA training.

Abstract

Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.

How Do VLAs Effectively Inherit from VLMs?

TL;DR

The paper tackles the problem of how vision-language-action models can effectively inherit the priors of large vision-language models for embodied control, addressing catastrophic forgetting during transfer. It introduces GrinningFace, a diagnostic emoji-tabletop benchmark that disentangles visual-semantic priors from motor skills, and validates it in both simulation and real robots. Through systematic experiments, the study compares pre-training and fine-tuning strategies (e.g., parameter-efficient methods like LoRA, freezing backbones, co-training, discretized vs latent targets, and diverse data) to identify what preserves VLM priors while enabling motor execution. Key findings show that co-training and latent-action targets help preserve priors and improve recognition, while naive initialization or discretized targets underperform; diverse VLA pre-training further aids transfer. The work provides actionable guidance and a reproducible framework for developing truly generalizable embodied AI systems, highlighting substantial room for improving how VLM priors are preserved and activated during VLA training.

Abstract

Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied control. We implement this diagnostic task in both simulated environment and a real robot, and compare various promising techniques for knowledge transfer. Specifically, we investigate the effects of parameter-efficient fine-tuning, VLM freezing, co-training, predicting discretized actions, and predicting latent actions. Through systematic evaluation, our work not only demonstrates the critical importance of preserving VLM priors for the generalization of VLA but also establishes guidelines for future research in developing truly generalizable embodied AI systems.

Paper Structure

This paper contains 9 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The GrinningFace task is designed for controlled experiments on how VLAs are pre-trained and fine-tuned to efficiently inherit the priors from VLMs. The robotic arm is asked to pick up the cube and place it on the instructed emoji. The emojis are sampled from the training set for fine-tuning and the validation set for evaluation. We align the viewpoint of this task with the bridge-v2 dataset walke2023bridgedata to investigate the role of VLA pre-training on the models. We also create a similar real robot setup to validate whether our findings in simulation holds on real robots.
  • Figure 2: The performance w.r.t. the number of fine-tuning gradient steps on the baseline VLA with full-parameter fine-tuning (left), the VLM backbone with full-parameter fine-tuning (middle), and the VLA pre-trained using LoRA with LoRA fine-tuning (right). The results indicate that while VLA pre-training enables fast adaptation in fine-tuning, it degrades the priors in VLM. Using LoRA in pre-training and fine-tuning can well preserve the VLM priors, but it needs more fine-tuning steps to obtain even simple motor skills.
  • Figure 3: The performance of different fine-tuning methods (full parameter fine-tuning, LoRA, and only fine-tuning action expert) on different VLAs, including the baseline VLA (Baseline), VLM backbone as the VLA (VLM), the VLA co-trained with vision-language tasks (Co-train), the VLA trained with discretized targets (Discrete), the VLA trained with latent action targets (Latent Action). We label the execution success rate (Exe. SR) on Val that indicates how well VLA preserves the VLM priors. We also label the fine-tuning steps on which these results are gathered, and they also indicate the checkpoints that achieve the best overall success rate ($=$ Exe. SR $\times$ Rec. SR) on Val.
  • Figure 4: The performance w.r.t. different VLA training datasets. We evaluate the VLAs pre-trained on the open-x-embodiment magic soup mixture (OXE), the bridge-v2 dataset (Bridge), and OXE excluding the Bridge dataset (OXE-Bridge) on VLAs trained with continuous targets or discretized targets. We highlight fine-tuning with only the action expert, which is an indicative probe to compare different pre-training datasets. The results indicate that training VLA on diverse dataset results in better performance.
  • Figure 5: The attention map of different image patches to [desc.] (the description of the target emoji) in the task instruction for the VLM backbone, the pre-trained VLA, and the fine-tuned VLA, on both simulator and real robot.