Table of Contents
Fetching ...

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, Yuke Zhu

TL;DR

It is found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch, and this finding implies that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay.

Abstract

Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://ut-austin-rpl.github.io/continual-vla

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

TL;DR

It is found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch, and this finding implies that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay.

Abstract

Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://ut-austin-rpl.github.io/continual-vla
Paper Structure (26 sections, 3 equations, 14 figures, 11 tables)

This paper contains 26 sections, 3 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Comparison of continual learning performance between a pretrained Vision-Language-Action (VLA) model (GR00T N1.5; nvidia2025gr00tn1openfoundation) and a non-pretrained small policy model (BC-Transformer; liu2023liberobenchmarkingknowledgetransfer). Each checkpoint corresponds to a model obtained by sequentially training over ten tasks under Experience Replay (ER), where the parameters at the start of training for checkpoint $i$ are initialized from checkpoint $i\!-\!1$. Each matrix entry $(i,j)$ denotes the success rate on Task $j$ after training on Task $i$. The columns track how a given task performance evolves as training continues (top to bottom). We compare a pretrained VLA model (top) with a non-pretrained small BC policy (bottom) across multiple LIBERO benchmark suites.
  • Figure 2: Negative Backward Transfer (NBT) across different replay buffer sizes. Each subplot shows NBT as a function of replay buffer size ($\{0.2\%, 2\%, 20\%\}$) for all methods across the four benchmarks and their average. Shaded regions indicate $\pm 1$ standard deviation across seeds. Higher NBT indicates more forgetting; values near zero indicate no forgetting. Results and discussion for LIBERO-10 are reported separately in Tab. \ref{['tab:cl_metrics_libero10']} in Appendix \ref{['app:libero10']}.
  • Figure 3: Comparison of forgetting performance across different buffer sizes ($10, 100, 1000$) for Pi0 pretrained, Pi0 initialized from Paligemma, and Pi0 trained from scratch.
  • Figure 4: Pareto frontier of average NBT vs. replay buffer size. We compare the forgetting performance (lower is better) across different buffer sizes for Pi0 model with different levels of pretraining. We also provide BC-Transformer as a non-pretrained, smaller model reference.
  • Figure 5: Knowledge transfer (sum of success rates) curves across four benchmarks. We compare Pi0 trained from scratch (orange), Pi0 trained from PaliGemma (green), and Pi0 pretrained (blue) under different replay buffer sizes ($10$, $100$, $1000$).
  • ...and 9 more figures