Refined Policy Distillation: From VLA Generalists to RL Experts
Tobias Jülg, Wolfram Burgard, Florian Walter
TL;DR
This work introduces Refined Policy Distillation (RPD), an on-policy RL framework that distills knowledge from Vision-Language-Action Models into compact, high-performing student policies. By augmenting PPO with a mean-squared error term that aligns the student’s action with the VLA’s guidance, RPD achieves faster convergence and can surpass the VLA teacher, even under sparse rewards and viewpoint changes. The approach is validated on ManiSkill3 tasks using fine-tuned Octo and OpenVLA, demonstrating improved sample efficiency, robustness to camera variations, and partial generalization to hold-out tasks. Limitations include dependence on the VLA’s quality and the computational demands of RL in simulation, with future work targeting sim-to-real deployment and data-collection strategies to extend these gains to real robots.
Abstract
Vision-Language-Action Models (VLAs) have demonstrated remarkable generalization capabilities in real-world experiments. However, their success rates are often not on par with expert policies, and they require fine-tuning when the setup changes. In this work, we introduce Refined Policy Distillation (RPD), a novel Reinforcement Learning (RL)-based policy refinement method that bridges this performance gap through a combination of on-policy RL with behavioral cloning. The core idea of RPD is to distill and refine VLAs into compact, high-performing expert policies by guiding the student policy during RL exploration using the actions of a teacher VLA, resulting in increased sample efficiency and faster convergence. We complement our method by fine-tuned versions of Octo and OpenVLA for ManiSkill3 to evaluate RPD in simulation. While this is a key requirement for applying RL, it also yields new insights beyond existing studies on VLA performance in real-world settings. Our experimental results across various manipulation tasks show that RPD enables the RL student to learn expert policies that outperform the VLA teacher in both dense and sparse reward settings, while also achieving faster convergence than the RL baseline. Our approach is even robust to changes in camera perspective and can generalize to task variations that the underlying VLA cannot solve. Our code, dataset, VLA checkpoints, and videos are available at https://refined-policy-distillation.github.io
