Table of Contents
Fetching ...

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, Haoang Li

Abstract

Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

Abstract

Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of VLA-OPD. Our framework unifies offline SFT and online RL through three phases. Phase 1 (Student Sampling): The student VLA policy interacts with the environment to collect on-policy trajectory rollouts ($O \rightarrow A \rightarrow O$). Phase 2 (Teacher Labeling): For each state visited by the student, a frozen expert teacher provides dense, token-level action labels ($\widehat{A}$) without executing them in the environment. Phase 3 (Student Optimization): The student is optimized against the teacher's distribution via a Reverse-KL objective. Unlike standard Forward-KL (bottom right) which induces mass-covering and entropy explosion, our Reverse-KL formulation (bottom left) promotes a mode-seeking behavior, effectively filtering out the teacher's out-of-distribution uncertainty and focusing on high-reward actions.
  • Figure 2: Training Efficiency Comparison. We compare our method with the baseline GRPO across two benchmarks. The red line (Ours (Distill)) demonstrates superior sample efficiency in the early stages, achieving high success rates with significantly fewer steps. The dashed orange line (Ours (Distill + GRPO)) shows that further RL fine-tuning breaks the performance bottleneck, surpassing the baseline's final convergence. Notably, on LIBERO-Long (b), our method achieves near 80% success rate in just 50 steps, whereas the baseline requires over 150 steps, representing a 3$\times$ speedup.
  • Figure 3: Seen--Unseen Trade-off for Forgetting Analysis. Each point corresponds to a checkpoint during fine-tuning. The x-axis is the success rate on seen (target) tasks, and the y-axis is the success rate on a held-out unseen task. Offline SFT exhibits a strong collapse on unseen tasks as seen-task performance increases, while on-policy methods (RL and our distillation) better preserve unseen-task capability.
  • Figure 4: Ablation study comparing Reverse KL, Forward KL, and Hard CE in an on-policy distillation setting, evaluated on the RoboTwin2.0 Beat Block Hammer task. (a) Forward KL suffers a severe early performance drop, and Hard CE plateaus at a suboptimal level, whereas Reverse KL shows steady and superior improvement. (b) These performance differences correlate directly with entropy extremes: Forward KL induces entropy explosion (mode-covering), while Hard CE causes premature entropy collapse (loss of action diversity). In contrast, Reverse KL maintains a healthy, bounded entropy via mode-seeking, ensuring stable training.
  • Figure 5: Ablation on Group Sampling Size ($G$). Training success rates demonstrate that while a larger group size ($G=8$) yields the smoothest optimization and highest final performance, smaller group sizes ($G=2, 4$) also achieve competitive success rates (over $80\%$) without performance collapse. This highlights a highly favorable trade-off between task performance and computational efficiency.