Table of Contents
Fetching ...

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, Dongbin Zhao

TL;DR

This work tackles the data-hungry and forgetting-prone nature of supervised fine-tuning for Vision-Language-Action (VLA) models by proposing LifeLong-RFT, a post-training strategy that uses chunking-level on-policy reinforcement learning guided by a Multi-Dimensional Process Reward (MDPR). MDPR decomposes feedback into Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward, enabling environment-free policy refinement and improved continual learning. Empirical results across SimplerEnv, LIBERO, and real-world tasks show consistent improvements over SFT baselines in multi-task and continual-learning settings, including a 22% gain in LIBERO continual learning and high data-efficiency for new tasks. The methodology offers a practical path toward long-lived robots by reducing reliance on online rewards and pre-trained reward models, though extending to continuous-action policies is noted as future work.

Abstract

Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed Multi-Dimensional Process Reward (MDPR) mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs.

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

TL;DR

This work tackles the data-hungry and forgetting-prone nature of supervised fine-tuning for Vision-Language-Action (VLA) models by proposing LifeLong-RFT, a post-training strategy that uses chunking-level on-policy reinforcement learning guided by a Multi-Dimensional Process Reward (MDPR). MDPR decomposes feedback into Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward, enabling environment-free policy refinement and improved continual learning. Empirical results across SimplerEnv, LIBERO, and real-world tasks show consistent improvements over SFT baselines in multi-task and continual-learning settings, including a 22% gain in LIBERO continual learning and high data-efficiency for new tasks. The methodology offers a practical path toward long-lived robots by reducing reliance on online rewards and pre-trained reward models, though extending to continuous-action policies is noted as future work.

Abstract

Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed Multi-Dimensional Process Reward (MDPR) mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs.
Paper Structure (43 sections, 6 equations, 9 figures, 13 tables, 2 algorithms)

This paper contains 43 sections, 6 equations, 9 figures, 13 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of the proposed LifeLong-RFT. This strategy integrates the chunking-level on-policy reinforcement learning algorithm with the Multi-Dimensional Process Reward mechanism to facilitate policy optimization.
  • Figure 2: Overview of real-world experimental tasks: Pick & Place (Banana, Bread), Pull Drawer, and Hang Chinese Knot.
  • Figure 3: Adaptation efficiency on representative new tasks.
  • Figure 4: Ablation study on the reward combination weights.
  • Figure 5: Representative reward curves during the training phase. The visualizations illustrate the training evolution of (a) MDPR, (b) QACR, and (c) CTAR.
  • ...and 4 more figures