Table of Contents
Fetching ...

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, Aviral Kumar

TL;DR

This work introduces Policy-Agnostic RL (PA-RL), a universal actor-critic framework that enables offline RL and online fine-tuning across diverse policy classes and backbones, including diffusion and autoregressive transformer policies. By decoupling policy improvement from policy parameter updates through a two-stage process—global action re-ranking plus local action optimization, followed by distillation of optimized actions via supervised learning—PA-RL achieves state-of-the-art performance in simulated benchmarks and real-world robotics. It demonstrates robust improvements in both offline-to-online settings and pure online fine-tuning, including successful autonomous fine-tuning of a 7B OpenVLA policy in real-world manipulation tasks. The approach broadens the practical applicability of RL by allowing a single method to train multiple policy architectures, significantly reducing the need for policy-class-specific algorithm design.

Abstract

Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

TL;DR

This work introduces Policy-Agnostic RL (PA-RL), a universal actor-critic framework that enables offline RL and online fine-tuning across diverse policy classes and backbones, including diffusion and autoregressive transformer policies. By decoupling policy improvement from policy parameter updates through a two-stage process—global action re-ranking plus local action optimization, followed by distillation of optimized actions via supervised learning—PA-RL achieves state-of-the-art performance in simulated benchmarks and real-world robotics. It demonstrates robust improvements in both offline-to-online settings and pure online fine-tuning, including successful autonomous fine-tuning of a 7B OpenVLA policy in real-world manipulation tasks. The approach broadens the practical applicability of RL by allowing a single method to train multiple policy architectures, significantly reducing the need for policy-class-specific algorithm design.

Abstract

Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Paper Structure

This paper contains 28 sections, 9 equations, 18 figures, 6 tables, 2 algorithms.

Figures (18)

  • Figure 1: Policy-agnostic reinforcement learning (PA-RL) is a simple approach for training any policy class and backbone via actor-critic RL in both the offline RL and online RL fine-tuning settings. This enables us to benefit from expressive power of different policy classes and priors from pre-training. Our results show that PA-RL is the first method to effectively improve diffusion policies and large generalist pre-trained policies in real-world robotic manipulation tasks. After pre-training with a few task demonstrations or zero-shot language-conditioned trials, it can significantly improve the performance of a base policy in as little as 40 minutes. On simulated benchmarks, we find substantially better results when using PA-RL with diffusion policies, where it sets a new state-of-the-art in both offline RL and online fine-tuning, as well as autoregressive policies.
  • Figure 2: An overview of PA-RL. Instead of directly passing critic gradients through the policy parameters, PA-RL first "optimizes" actions via critic re-ranking and gradient ascent. Then, it trains the policy to mimic the most optimized action.
  • Figure 3: Learning curves of online fine-tuning with various methods. Observe that PA-RL + Cal-QL (red) largely always dominates or attains similar performance to the next best method. Other methods for fine-tuning diffusion policies (IDQL, DQL, DPPO) are a bit unstable, and perform substantially worse. Since DPPO is substantially more data inefficient, we plot it with different x-axis units: for kitchen each unit is 500 episodes (axis goes from 0 to 500k), for antmaze each unit is 100 episodes (axis goes from 0 to 100k) and for calvin each unit is 10 episodes (axis goes until 10k).
  • Figure 4: Evolution of learned behaviors during online fine-tuning of diffusion policies with PA-RL on task (a), with a new initial location for the cup. The offline initialization (in red) fails to both grasp the cup and place it on the rack. During intermediate online interaction episodes (in yellow), it successfully grasps the cup, but fails to place it on the rack. After 50 episodes (in green), it learns to successfully grasp the cup and place it on the rack.
  • Figure 5: Comparison with CEM optimizer. Instead of using the action optimization procedure detailed in Section \ref{['sec:method']}, any time the Cal-QL algorithm queries the policy we perform a Cross-Entropy Method optimization process to obtain actions. We use the same CEM hyper-parameters as simmons2019q, and maintain the Cal-QL hyper-parameters and architectures as PA-RL. for all tested environments, the performance after pre-training (i.e. at step 0, before taking any online steps) is at or close to 0, and performance improves over the course of fine-tuning, but remaining well below PA-RL with a diffusion policy.
  • ...and 13 more figures