Table of Contents
Fetching ...

Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao

TL;DR

Evo-1 tackles the efficiency bottlenecks of Vision-Language-Action models by introducing a lightweight, 0.77B-parameter VLA that foregoes robot-data pretraining. It builds on a native multimodal Vision-Language Model with a cross-modulated diffusion transformer and a lean integration module, trained via a two-stage procedure to preserve semantic alignment. Empirical results across Meta-World, LIBERO, RoboTwin, and real-world xArm6 and LeRobot SO-100 show state-of-the-art or competitive performance with high inference frequency and low memory usage. The work demonstrates that strong visuomotor control can be achieved with compact architectures and careful training strategies, enabling practical deployment.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.

Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

TL;DR

Evo-1 tackles the efficiency bottlenecks of Vision-Language-Action models by introducing a lightweight, 0.77B-parameter VLA that foregoes robot-data pretraining. It builds on a native multimodal Vision-Language Model with a cross-modulated diffusion transformer and a lean integration module, trained via a two-stage procedure to preserve semantic alignment. Empirical results across Meta-World, LIBERO, RoboTwin, and real-world xArm6 and LeRobot SO-100 show state-of-the-art or competitive performance with high inference frequency and low memory usage. The work demonstrates that strong visuomotor control can be achieved with compact architectures and careful training strategies, enabling practical deployment.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.

Paper Structure

This paper contains 33 sections, 5 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Architecture of Evo-1. The input RGB observations and language instructions are first encoded by a compact vision-language backbone. Their fused representations are aligned with the robot state through an optimized integration module and then processed by a cross-modulated diffusion transformer to generate actions. The right side shows results across three simulation benchmarks.
  • Figure 2: Comparison of vision-language attention maps after training. (a) Evo-1 (InternVL3-1B) yields spatially consistent and semantically aligned activations. (b) OpenVLA (Prismatic-7B) shows degraded coherence in attention maps.
  • Figure 3: Task progress of Real-World Experiments. Step-by-step sequences for the real-world tasks. Each row shows the detailed progression of a task from start to completion.
  • Figure 4: Results of Real-World experiments. Success rates of four real-world evaluation tasks (left four subplots) and the overall average success rate across tasks (rightmost subplot).
  • Figure 5: Disturbance settings of generalization experiments. We evaluate model generalization under four variations: (1) unseen distractor object, (2) background color variation, (3) target position variation, and (4) target height variation.
  • ...and 10 more figures