Table of Contents
Fetching ...

Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, Hao Su

TL;DR

This work tackles the challenge of improving offline-trained large policy models in robotics by online refinement through a model-agnostic residual policy. The Residual Policy is trained with SAC on top of a frozen base policy, using bounded actions and a controlled exploration schedule to ensure stable, sample-efficient learning. Across ManiSkill and Adroit, with base models BeT and Diffusion Policy, Policy Decorator yields near-perfect task performance while preserving the smooth motions characteristic of imitation learning, outperforming both fine-tuning and non-fine-tuning baselines. The approach demonstrates robustness across observation modalities and base-policy architectures, and highlights the importance of component design and hyperparameters in practical online refinement.

Abstract

Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks-ManiSkill and Adroit-and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies. See our project page (https://policydecorator.github.io) for videos.

Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

TL;DR

This work tackles the challenge of improving offline-trained large policy models in robotics by online refinement through a model-agnostic residual policy. The Residual Policy is trained with SAC on top of a frozen base policy, using bounded actions and a controlled exploration schedule to ensure stable, sample-efficient learning. Across ManiSkill and Adroit, with base models BeT and Diffusion Policy, Policy Decorator yields near-perfect task performance while preserving the smooth motions characteristic of imitation learning, outperforming both fine-tuning and non-fine-tuning baselines. The approach demonstrates robustness across observation modalities and base-policy architectures, and highlights the importance of component design and hyperparameters in practical online refinement.

Abstract

Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks-ManiSkill and Adroit-and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies. See our project page (https://policydecorator.github.io) for videos.

Paper Structure

This paper contains 84 sections, 1 equation, 30 figures, 15 tables.

Figures (30)

  • Figure 1: Policy Decorator improves base policy to near-perfect performance on two benchmarks, outperforming fine-tuning and non-fine-tuning baselines.
  • Figure 2: Our framework (Policy Decorator) improves large policy models through online interactions. We learn a residual policy via RL using controlled exploration strategies (Sec. \ref{['sec:control_explore']}). Once learned, it functions similarly to Python decorators—wrapping the base policy with an additional function to boost performance.
  • Figure 3: Small adjustments can bring deviated trajectories back on track.
  • Figure 4: Progressive Exploration Schedule.
  • Figure 5: Tasks Visualizations. ManiSkill (left four figures) and Adroit (right four figures).
  • ...and 25 more figures