Table of Contents
Fetching ...

Continually Evolving Skill Knowledge in Vision Language Action Model

Yuxuan Wu, Guangming Wang, Zhiheng Yang, Maoqing Yao, Brian Sheil, Hesheng Wang

TL;DR

The paper tackles continual learning for vision-language-action robotics by introducing Stellar VLA, a framework that learns a self-evolving knowledge space anchored in Dirichlet Process priors. It presents two variants: T-Stellar, which builds a task-centric knowledge space, and TS-Stellar, which additionally models hierarchical task–skill relations for reusable skills. A knowledge-guided Mixture-of-Experts routing mechanism leverages the evolving knowledge priors to balance sharing and specialization, enabling robust continual adaptation. Across LIBERO benchmarks and real-world dual-arm tasks, the approach delivers significant improvements in final success rates and stability, demonstrating effective knowledge retention and discovery. The work advances scalable continual learning for large VLA models and suggests pathways for larger, more complex skill spaces in future robotic systems.

Abstract

Developing general robot intelligence in open environments requires continual skill learning. Recent Vision-Language-Action (VLA) models leverage massive pretraining data to support diverse manipulation tasks, but they still depend heavily on task-specific fine-tuning, revealing a lack of continual learning capability. Existing continual learning methods are also resource-intensive to scale to VLA models. We propose Stellar VLA, a knowledge-driven continual learning framework with two variants: T-Stellar, modeling task-centric knowledge space, and TS-Stellar, capturing hierarchical task-skill structure. Stellar VLA enables self-supervised knowledge evolution through joint learning of task latent representation and the knowledge space, reducing annotation needs. Knowledge-guided expert routing provide task specialization without extra network parameters, lowering training overhead. Experiments on the LIBERO benchmark and real-world tasks show over 50 percentage average improvement in final success rates relative to baselines. TS-Stellar further excels in complex action inference, and in-depth analyses verify effective knowledge retention and discovery. Our code will be released soon.

Continually Evolving Skill Knowledge in Vision Language Action Model

TL;DR

The paper tackles continual learning for vision-language-action robotics by introducing Stellar VLA, a framework that learns a self-evolving knowledge space anchored in Dirichlet Process priors. It presents two variants: T-Stellar, which builds a task-centric knowledge space, and TS-Stellar, which additionally models hierarchical task–skill relations for reusable skills. A knowledge-guided Mixture-of-Experts routing mechanism leverages the evolving knowledge priors to balance sharing and specialization, enabling robust continual adaptation. Across LIBERO benchmarks and real-world dual-arm tasks, the approach delivers significant improvements in final success rates and stability, demonstrating effective knowledge retention and discovery. The work advances scalable continual learning for large VLA models and suggests pathways for larger, more complex skill spaces in future robotic systems.

Abstract

Developing general robot intelligence in open environments requires continual skill learning. Recent Vision-Language-Action (VLA) models leverage massive pretraining data to support diverse manipulation tasks, but they still depend heavily on task-specific fine-tuning, revealing a lack of continual learning capability. Existing continual learning methods are also resource-intensive to scale to VLA models. We propose Stellar VLA, a knowledge-driven continual learning framework with two variants: T-Stellar, modeling task-centric knowledge space, and TS-Stellar, capturing hierarchical task-skill structure. Stellar VLA enables self-supervised knowledge evolution through joint learning of task latent representation and the knowledge space, reducing annotation needs. Knowledge-guided expert routing provide task specialization without extra network parameters, lowering training overhead. Experiments on the LIBERO benchmark and real-world tasks show over 50 percentage average improvement in final success rates relative to baselines. TS-Stellar further excels in complex action inference, and in-depth analyses verify effective knowledge retention and discovery. Our code will be released soon.

Paper Structure

This paper contains 13 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 2: Overall architecture of Stellar VLA. CLIP radford2021learning and FiLM perez2018film-conditioned ResNet encode language and visual inputs respectively. The task-centric representation $z$ and knowledge space are jointly learned through knowledge update and latents aggregation, as detailed in \ref{['sec:k-learn']}. The learned knowledge prior finally guides the MoE action head for motion prediction, as detailed in \ref{['sec:moe']}.
  • Figure 3: Knowledge-prior-routed MoE action head. Two knowledge embeddings, relation and top-K semantic, are computed for expert routing, alongside language, noise, observation and noise action tokens fed into the denoising transformer.
  • Figure 4: T-SNE visualization of Stellar VLA Latent representations after 1, 4, 6, 8, and 10 tasks on LIBERO-long are shown. Task names are abbreviated for clarity. T-Stellar models discrete task distributions, and TS-Stellar learn relevant skill across tasks.
  • Figure 5: Behavior visualization on "Pick up Bag" after training on "Handover Toy". TS-Stellar achieves the most synchronized dual-arm motion; T-Stellar hesitates slightly; MoDE$^*$ and w/o KS show strong desynchronization and ultimately fail.