Continually Evolving Skill Knowledge in Vision Language Action Model

Yuxuan Wu; Guangming Wang; Zhiheng Yang; Maoqing Yao; Brian Sheil; Hesheng Wang

Continually Evolving Skill Knowledge in Vision Language Action Model

Yuxuan Wu, Guangming Wang, Zhiheng Yang, Maoqing Yao, Brian Sheil, Hesheng Wang

TL;DR

The paper tackles continual learning for vision-language-action robotics by introducing Stellar VLA, a framework that learns a self-evolving knowledge space anchored in Dirichlet Process priors. It presents two variants: T-Stellar, which builds a task-centric knowledge space, and TS-Stellar, which additionally models hierarchical task–skill relations for reusable skills. A knowledge-guided Mixture-of-Experts routing mechanism leverages the evolving knowledge priors to balance sharing and specialization, enabling robust continual adaptation. Across LIBERO benchmarks and real-world dual-arm tasks, the approach delivers significant improvements in final success rates and stability, demonstrating effective knowledge retention and discovery. The work advances scalable continual learning for large VLA models and suggests pathways for larger, more complex skill spaces in future robotic systems.

Abstract

Developing general robot intelligence in open environments requires continual skill learning. Recent Vision-Language-Action (VLA) models leverage massive pretraining data to support diverse manipulation tasks, but they still depend heavily on task-specific fine-tuning, revealing a lack of continual learning capability. Existing continual learning methods are also resource-intensive to scale to VLA models. We propose Stellar VLA, a knowledge-driven continual learning framework with two variants: T-Stellar, modeling task-centric knowledge space, and TS-Stellar, capturing hierarchical task-skill structure. Stellar VLA enables self-supervised knowledge evolution through joint learning of task latent representation and the knowledge space, reducing annotation needs. Knowledge-guided expert routing provide task specialization without extra network parameters, lowering training overhead. Experiments on the LIBERO benchmark and real-world tasks show over 50 percentage average improvement in final success rates relative to baselines. TS-Stellar further excels in complex action inference, and in-depth analyses verify effective knowledge retention and discovery. Our code will be released soon.

Continually Evolving Skill Knowledge in Vision Language Action Model

TL;DR

Abstract

Continually Evolving Skill Knowledge in Vision Language Action Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)