Table of Contents
Fetching ...

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan

TL;DR

This work tackles the high cost of training large vision-language action models (VLA) for robotic manipulation by introducing VITA-VLA, a distillation-based framework that preserves the VLM architecture while injecting action-execution via a lightweight state encoder and a learnable action token. Action knowledge is transferred from a pretrained small action model through a two-stage process: a lightweight alignment stage that maps VLM hidden states into the small model’s action space (reusing its action decoder), followed by end-to-end fine-tuning of the language model, state encoder, and action modules. The approach achieves state-of-the-art performance on LIBERO-LONG (avg. 97.3% success, +11.8% over prior SOTA) and CALVIN-ABC-D benchmarks, and demonstrates robust real-world performance (82.0% avg success across five tasks on ALOHA) while reducing the computational burden of training from scratch. This distillation-based framework offers a scalable path to leverage powerful VLMs for precise, long-horizon robotic manipulation with substantially lower training costs.

Abstract

Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

TL;DR

This work tackles the high cost of training large vision-language action models (VLA) for robotic manipulation by introducing VITA-VLA, a distillation-based framework that preserves the VLM architecture while injecting action-execution via a lightweight state encoder and a learnable action token. Action knowledge is transferred from a pretrained small action model through a two-stage process: a lightweight alignment stage that maps VLM hidden states into the small model’s action space (reusing its action decoder), followed by end-to-end fine-tuning of the language model, state encoder, and action modules. The approach achieves state-of-the-art performance on LIBERO-LONG (avg. 97.3% success, +11.8% over prior SOTA) and CALVIN-ABC-D benchmarks, and demonstrates robust real-world performance (82.0% avg success across five tasks on ALOHA) while reducing the computational burden of training from scratch. This distillation-based framework offers a scalable path to leverage powerful VLMs for precise, long-horizon robotic manipulation with substantially lower training costs.

Abstract

Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

Paper Structure

This paper contains 20 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of mainstream VLA architectures. (1) Discretization-based methods map vision and language features into action tokens via LLM, but ignore robot state—an essential signal of physical dynamics—making action prediction less effective. (2) Diffusion-based approaches extract vision and language features with a VLM but pass them to a separate action expert for denoising, reducing the VLM to a large feature extractor and limiting its overall capability in action modeling. (3) Our model distills knowledge from a small action model while largely preserving the VLM structure. By integrating robot state through a lightweight encoder and introducing an action token to fuse vision, language, and state, it enables the VLM to actively participate in action modeling rather than only serving as a feature extractor, thereby better leveraging its modeling capabilities.
  • Figure 2: Model Architecture. Our model is build upon VITA-1.5-7B fu2025vita, taking images, instructions, action tokens, and state information as inputs to generate executable actions. The visual and textual information is input into the VLM. The action token acts as a learnable query, while the robot state is encoded into a single token using linear layers. An action mapper extracts the hidden states of the action token from the final layer of the VLM, and transforms these to match the dimensionality expected by the pretrained action decoder, and finally the action decoder generates the corresponding actions with 7 degrees of freedom (DoF).
  • Figure 3: Training Strategy. Our training strategy comprises two stages. In the alignment stage, we train the action mapper, action tokens, and state encoder to bridge the gap between the action output spaces of the VLM and the small action model, updating only 30 million parameters while achieving improved fine-tuning outcomes. In the fine-tuning stage, we then perform end-to-end optimization of the entire model to further enhance overall performance.
  • Figure 4: Real-world Tasks. To evaluate the model in real-world settings, we formulate five tasks that span four canonical operations: Pick, Place, Close, and Stack.
  • Figure 5: Real robot setup. The platform consists of a PiPer robotic arm with a Songling gripper, equipped with two complementary cameras: an Intel RealSense D435i base camera (1280$\times$720) and a Dabai DCW gripper-mounted depth camera (640$\times$480).
  • ...and 2 more figures