Table of Contents
Fetching ...

Refined Policy Distillation: From VLA Generalists to RL Experts

Tobias Jülg, Wolfram Burgard, Florian Walter

TL;DR

This work introduces Refined Policy Distillation (RPD), an on-policy RL framework that distills knowledge from Vision-Language-Action Models into compact, high-performing student policies. By augmenting PPO with a mean-squared error term that aligns the student’s action with the VLA’s guidance, RPD achieves faster convergence and can surpass the VLA teacher, even under sparse rewards and viewpoint changes. The approach is validated on ManiSkill3 tasks using fine-tuned Octo and OpenVLA, demonstrating improved sample efficiency, robustness to camera variations, and partial generalization to hold-out tasks. Limitations include dependence on the VLA’s quality and the computational demands of RL in simulation, with future work targeting sim-to-real deployment and data-collection strategies to extend these gains to real robots.

Abstract

Vision-Language-Action Models (VLAs) have demonstrated remarkable generalization capabilities in real-world experiments. However, their success rates are often not on par with expert policies, and they require fine-tuning when the setup changes. In this work, we introduce Refined Policy Distillation (RPD), a novel Reinforcement Learning (RL)-based policy refinement method that bridges this performance gap through a combination of on-policy RL with behavioral cloning. The core idea of RPD is to distill and refine VLAs into compact, high-performing expert policies by guiding the student policy during RL exploration using the actions of a teacher VLA, resulting in increased sample efficiency and faster convergence. We complement our method by fine-tuned versions of Octo and OpenVLA for ManiSkill3 to evaluate RPD in simulation. While this is a key requirement for applying RL, it also yields new insights beyond existing studies on VLA performance in real-world settings. Our experimental results across various manipulation tasks show that RPD enables the RL student to learn expert policies that outperform the VLA teacher in both dense and sparse reward settings, while also achieving faster convergence than the RL baseline. Our approach is even robust to changes in camera perspective and can generalize to task variations that the underlying VLA cannot solve. Our code, dataset, VLA checkpoints, and videos are available at https://refined-policy-distillation.github.io

Refined Policy Distillation: From VLA Generalists to RL Experts

TL;DR

This work introduces Refined Policy Distillation (RPD), an on-policy RL framework that distills knowledge from Vision-Language-Action Models into compact, high-performing student policies. By augmenting PPO with a mean-squared error term that aligns the student’s action with the VLA’s guidance, RPD achieves faster convergence and can surpass the VLA teacher, even under sparse rewards and viewpoint changes. The approach is validated on ManiSkill3 tasks using fine-tuned Octo and OpenVLA, demonstrating improved sample efficiency, robustness to camera variations, and partial generalization to hold-out tasks. Limitations include dependence on the VLA’s quality and the computational demands of RL in simulation, with future work targeting sim-to-real deployment and data-collection strategies to extend these gains to real robots.

Abstract

Vision-Language-Action Models (VLAs) have demonstrated remarkable generalization capabilities in real-world experiments. However, their success rates are often not on par with expert policies, and they require fine-tuning when the setup changes. In this work, we introduce Refined Policy Distillation (RPD), a novel Reinforcement Learning (RL)-based policy refinement method that bridges this performance gap through a combination of on-policy RL with behavioral cloning. The core idea of RPD is to distill and refine VLAs into compact, high-performing expert policies by guiding the student policy during RL exploration using the actions of a teacher VLA, resulting in increased sample efficiency and faster convergence. We complement our method by fine-tuned versions of Octo and OpenVLA for ManiSkill3 to evaluate RPD in simulation. While this is a key requirement for applying RL, it also yields new insights beyond existing studies on VLA performance in real-world settings. Our experimental results across various manipulation tasks show that RPD enables the RL student to learn expert policies that outperform the VLA teacher in both dense and sparse reward settings, while also achieving faster convergence than the RL baseline. Our approach is even robust to changes in camera perspective and can generalize to task variations that the underlying VLA cannot solve. Our code, dataset, VLA checkpoints, and videos are available at https://refined-policy-distillation.github.io

Paper Structure

This paper contains 15 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The architecture of RPD: We distill a VLA teacher into an RL student policy. This has two effects: First, the RL student agent is bootstrapped with guided exploration through the VLA teacher. Second, the RL student policy can interact with the environment and is, thus, able to surpass the VLA's performance, refining the distilled policy.
  • Figure 2: Overview of the eight different tasks from ManiSkill3 that were used to distill expert policies with RPD. See \ref{['tab:tasks']} for the task names (tasks are depicted in row-major order).
  • Figure 3: Average success rates during training for vanilla PPO, PPD, and three different RPD variants: RPD-MSE, RPD-L1, and RPD-BC, which all distill from Octo over five runs with different seeds. Shaded areas indicate standard deviations. We also evaluated RPD-MSE on OpenVLA. Due to computing constraints, we could only perform a single training run. The performance of the fine-tuned VLAs is indicated by dashed lines.
  • Figure 4: Success and reward curves for RPD with Octo and for vanilla PPO for both dense and sparse rewards on the six different ManiSkill3 tasks that are part of the fine-tuning dataset. The values are averaged over five runs with different seeds and recorded in an evaluation environment. Standard deviations are indicated by shaded areas. All training runs were performed with the same hyperparameters from the ManiSkill3 PPO baseline. For all tasks, RPD outperforms Octo quickly and converges faster than vanilla PPO. In some cases, it even finds good policies when vanilla PPO fails. This effect increases for sparse rewards.
  • Figure 5: Success and reward curves for RPD with Octo and for vanilla PPO on tasks that are not part of Octo's fine-tuning dataset. The curves show average results for five training runs on different seeds and are recorded in evaluation environments. The shaded areas indicate standard deviations. Note that the Octo baseline does not represent the correct reward in the dense reward setting as it solves the tasks without tool use. See \ref{['sec:meth']} for details.
  • ...and 2 more figures