Table of Contents
Fetching ...

Offline-to-online Reinforcement Learning for Image-based Grasping with Scarce Demonstrations

Bryan Chan, Anson Leung, James Bergstra

TL;DR

The paper tackles sample-efficient image-based robotic manipulation when demonstrations are scarce, a setting where behavioural cloning suffers from distribution shift. It introduces Simplified Q, an offline-to-online reinforcement learning algorithm that replaces the traditional target network with an NTK-inspired regularizer to decorrelate latent representations and stabilize Q-learning, built on Conservative Q-learning with $N$-step backups and symmetric sampling. Empirically, the method is validated on a real UR10e grasping task, achieving above $90\%$ success within roughly two hours of interaction using only 50 demonstrations, and outperforming BC and several RL baselines in online learning. The results indicate that pretrained vision backbones are not strictly necessary for efficient O2O RL in this setting, and they highlight the NTK regularizer as a practical mechanism to curb Q-divergence, enabling safer and more data-efficient learning for real-world robotics.

Abstract

Offline-to-online reinforcement learning (O2O RL) aims to obtain a continually improving policy as it interacts with the environment, while ensuring the initial policy behaviour is satisficing. This satisficing behaviour is necessary for robotic manipulation where random exploration can be costly due to catastrophic failures and time. O2O RL is especially compelling when we can only obtain a scarce amount of (potentially suboptimal) demonstrations$\unicode{x2014}$a scenario where behavioural cloning (BC) is known to suffer from distribution shift. Previous works have outlined the challenges in applying O2O RL algorithms under the image-based environments. In this work, we propose a novel O2O RL algorithm that can learn in a real-life image-based robotic vacuum grasping task with a small number of demonstrations where BC fails majority of the time. The proposed algorithm replaces the target network in off-policy actor-critic algorithms with a regularization technique inspired by neural tangent kernel. We demonstrate that the proposed algorithm can reach above 90\% success rate in under two hours of interaction time, with only 50 human demonstrations, while BC and existing commonly-used RL algorithms fail to achieve similar performance.

Offline-to-online Reinforcement Learning for Image-based Grasping with Scarce Demonstrations

TL;DR

The paper tackles sample-efficient image-based robotic manipulation when demonstrations are scarce, a setting where behavioural cloning suffers from distribution shift. It introduces Simplified Q, an offline-to-online reinforcement learning algorithm that replaces the traditional target network with an NTK-inspired regularizer to decorrelate latent representations and stabilize Q-learning, built on Conservative Q-learning with -step backups and symmetric sampling. Empirically, the method is validated on a real UR10e grasping task, achieving above success within roughly two hours of interaction using only 50 demonstrations, and outperforming BC and several RL baselines in online learning. The results indicate that pretrained vision backbones are not strictly necessary for efficient O2O RL in this setting, and they highlight the NTK regularizer as a practical mechanism to curb Q-divergence, enabling safer and more data-efficient learning for real-world robotics.

Abstract

Offline-to-online reinforcement learning (O2O RL) aims to obtain a continually improving policy as it interacts with the environment, while ensuring the initial policy behaviour is satisficing. This satisficing behaviour is necessary for robotic manipulation where random exploration can be costly due to catastrophic failures and time. O2O RL is especially compelling when we can only obtain a scarce amount of (potentially suboptimal) demonstrationsa scenario where behavioural cloning (BC) is known to suffer from distribution shift. Previous works have outlined the challenges in applying O2O RL algorithms under the image-based environments. In this work, we propose a novel O2O RL algorithm that can learn in a real-life image-based robotic vacuum grasping task with a small number of demonstrations where BC fails majority of the time. The proposed algorithm replaces the target network in off-policy actor-critic algorithms with a regularization technique inspired by neural tangent kernel. We demonstrate that the proposed algorithm can reach above 90\% success rate in under two hours of interaction time, with only 50 human demonstrations, while BC and existing commonly-used RL algorithms fail to achieve similar performance.

Paper Structure

This paper contains 23 sections, 7 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: (Left) Image-based grasping environment setup. The agent is required to control the UR10e arm with vacuum suction to grasp the orange rice bag inside the bin and lift it well above the bin. (Middle) Comparison between BC and offline RL trained with Simplified Q (Ours). Simplified Q is able to grasp with limited success while BC performs marginally better than Simplified Q. (Right) The impact of offline dataset size on BC. Here BC is only able to achieve around $35\%$ success rate until we further include image augmentation from yarats2021image. The success rates of various offline-trained policies. Each policy is evaluated on 50 grasp attempts.
  • Figure 2: Aggregated success rate (Top) and P-stop rate (Bottom) across three seeds with 95% confidence intervals (CIs) Rishabh2021iqm. Simplified Q (Ours) performs better than DR3 in both success rates and P-stop rate generally. Furthermore, DR3 obtains significantly wider CIs compared to Simplified Q in both success rate and P-stop rate.
  • Figure 3: Success rate (Left) and P-stop rate (Right) across three seeds, averaged at every 10 episodes. Results are shown as an interquartile mean and shaded regions show 95% stratified bootstrap confidence intervals (CIs) Rishabh2021iqm. Simplified Q consistently achieves higher success rate and lower P-stop rate as amount of online interaction increases. While DR3 can achieve reasonable success rates, its CI is significantly wider than that of Simplified Q.
  • Figure 4: The frequency of actions being taken by the policy during training. We compare Simplified Q (Ours), CrossQ, and SAC. Our policy appears to be able to perform fine-grained actions on the $xy$ axes while CrossQ and SAC exhibits bang-bang behaviours. SAC further appears to have converged into moving towards a single direction.
  • Figure 5: (Left)The probability of Simplified Q (Ours) being better than existing RL algorithms in success rate, run over three seeds. (Middle)Comparison between image encoders: (1) trained end-to-end (E2E), (2) randomly initialized image encoder, and (3) pretrained image encoder with HILP objective park2024foundation. The model with a frozen randomly-initialized encoder fails to improve its grasp success rate even after online interactions, while the models trained end-to-end and with a frozen pretrained image-encoder can continually improve as it gathers more data. (Right) Asymptotic performance between training end-to-end and frozen pretrained image encoder. The model trained E2E appears to achieve better asymptotic performance than one with a frozen pretrained image encoder. All models are first pretrained for 100K gradient steps with 50 human-teleoperated demonstrations.
  • ...and 7 more figures