Table of Contents
Fetching ...

EvoVLA: Self-Evolving Vision-Language-Action Model

Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang

TL;DR

EvoVLA is presented, a self-supervised VLA framework that addresses long-horizon robotic manipulation issues through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts, and Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels.

Abstract

Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

EvoVLA: Self-Evolving Vision-Language-Action Model

TL;DR

EvoVLA is presented, a self-supervised VLA framework that addresses long-horizon robotic manipulation issues through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts, and Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels.

Abstract

Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

Paper Structure

This paper contains 41 sections, 14 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: EvoVLA: Addressing stage hallucination in long-horizon manipulation. Our framework combines Stage-Aligned Reward (SAR), Pose-Based Object Exploration (POE), and Long-Horizon Memory to achieve robust performance on complex manipulation tasks. Example results shown: Block Bridge, Stack, and Cup Stacking with Insertion tasks across simulation and real-world deployment.
  • Figure 2: EvoVLA Data Engine. Aligned with Discoverse-L and the video-driven stage discovery pipeline to close the data–reward–policy loop.
  • Figure 3: EvoVLA overview. Built on OpenVLA-OFT backbone, EvoVLA integrates three modules: Stage-Aligned Reward (SAR) with hard negatives and temporal smoothing, Pose-Based Object Exploration (POE) via world models, and Long-Horizon Memory with context selection and gated fusion. The framework couples with Discoverse-L for training and deploys to real robots.
  • Figure 4: Simulation and real-world rollouts. Columns depict temporal progression, rows pair the same task family across domains. Left: dual-camera Discoverse-L rollouts for Stack, Bridge, Jujube-Cup; right: AIRBOT-Play deployments for Stack, Bridge, Insert. The alignment highlights consistent gripper–object interactions and viewpoints after Sim2Real transfer.
  • Figure 5: Stack task qualitative comparison. Columns (1--5) cover sub-task 1 (left block onto middle) then sub-task 2 (right block on top). Top: OpenVLA-OFT (cross markers) opens before contact, dithers, misaligns, and drops the block. Bottom: EvoVLA (check marks) delays opening until contact, aligns within a few corrections, and leaves a stable stack, matching the hallucination reductions in Section \ref{['sec:real-world']}.
  • ...and 5 more figures