Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Huihan Liu; Changyeon Kim; Bo Liu; Minghuan Liu; Yuke Zhu

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, Yuke Zhu

TL;DR

It is found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch, and this finding implies that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay.

Abstract

Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://ut-austin-rpl.github.io/continual-vla

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

TL;DR

Abstract

Paper Structure (26 sections, 3 equations, 14 figures, 11 tables)

This paper contains 26 sections, 3 equations, 14 figures, 11 tables.

Introduction
Preliminaries
Continual Learning in Robotics
Vision-Language-Action Models
VLAs are Surprisingly Resistant to Forgetting
Evaluating VLAs in Continual Learning
The Surprising Effectiveness of Experience Replay
Pretraining Plays an Integral Role in Improving Continual Learning Performance
VLAs Retain Knowledge that is Seemingly Forgotten
Related Work
Continual Learning Beyond Training from Scratch
VLAs and Lifelong Robot Learning
Conclusion and Discussion
More Continual Learning Results
Confusion Matrix Results for Comparison
...and 11 more sections

Figures (14)

Figure 1: Comparison of continual learning performance between a pretrained Vision-Language-Action (VLA) model (GR00T N1.5; nvidia2025gr00tn1openfoundation) and a non-pretrained small policy model (BC-Transformer; liu2023liberobenchmarkingknowledgetransfer). Each checkpoint corresponds to a model obtained by sequentially training over ten tasks under Experience Replay (ER), where the parameters at the start of training for checkpoint $i$ are initialized from checkpoint $i\!-\!1$. Each matrix entry $(i,j)$ denotes the success rate on Task $j$ after training on Task $i$. The columns track how a given task performance evolves as training continues (top to bottom). We compare a pretrained VLA model (top) with a non-pretrained small BC policy (bottom) across multiple LIBERO benchmark suites.
Figure 2: Negative Backward Transfer (NBT) across different replay buffer sizes. Each subplot shows NBT as a function of replay buffer size ($\{0.2\%, 2\%, 20\%\}$) for all methods across the four benchmarks and their average. Shaded regions indicate $\pm 1$ standard deviation across seeds. Higher NBT indicates more forgetting; values near zero indicate no forgetting. Results and discussion for LIBERO-10 are reported separately in Tab. \ref{['tab:cl_metrics_libero10']} in Appendix \ref{['app:libero10']}.
Figure 3: Comparison of forgetting performance across different buffer sizes ($10, 100, 1000$) for Pi0 pretrained, Pi0 initialized from Paligemma, and Pi0 trained from scratch.
Figure 4: Pareto frontier of average NBT vs. replay buffer size. We compare the forgetting performance (lower is better) across different buffer sizes for Pi0 model with different levels of pretraining. We also provide BC-Transformer as a non-pretrained, smaller model reference.
Figure 5: Knowledge transfer (sum of success rates) curves across four benchmarks. We compare Pi0 trained from scratch (orange), Pi0 trained from PaliGemma (green), and Pi0 pretrained (blue) under different replay buffer sizes ($10$, $100$, $1000$).
...and 9 more figures

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

TL;DR

Abstract

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)