Table of Contents
Fetching ...

General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

Fahim Shahriar, Cheryl Wang, Alireza Azimi, Gautham Vasan, Hany Hamed Elanwar, A. Rupam Mahmood, Colin Bellinger

TL;DR

This paper addresses the challenge of goal representation in visual GCRL by introducing an object-agnostic mask-based goal conditioning and a mask-derived dense reward. The method appends a dynamic binary mask to visual observations and uses three-frame frame stacking to provide progression cues, enabling fast learning and strong generalization to unseen targets. It evaluates SAC and PPO across three robotics environments, demonstrating near-perfect reaching accuracy on training and unseen objects (99.9%), and shows that mask-based rewards can rival distance-based rewards with improved stability. The work also demonstrates sim-to-real transfer and real-world learning from scratch using open-vocabulary detectors (Detic, Grounding DINO), highlighting practical potential for vision-driven robotic manipulation without privileged information.

Abstract

Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.

General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

TL;DR

This paper addresses the challenge of goal representation in visual GCRL by introducing an object-agnostic mask-based goal conditioning and a mask-derived dense reward. The method appends a dynamic binary mask to visual observations and uses three-frame frame stacking to provide progression cues, enabling fast learning and strong generalization to unseen targets. It evaluates SAC and PPO across three robotics environments, demonstrating near-perfect reaching accuracy on training and unseen objects (99.9%), and shows that mask-based rewards can rival distance-based rewards with improved stability. The work also demonstrates sim-to-real transfer and real-world learning from scratch using open-vocabulary detectors (Detic, Grounding DINO), highlighting practical potential for vision-driven robotic manipulation without privileged information.

Abstract

Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Simulated and real-world environments.
  • Figure 2: Simulated experiment results
  • Figure 3: (a) Trial results for Experiment 3, (b) Learning curves for the real-world learning from scratch task.
  • Figure 4: A trained mask-based GC and reward agent successfully picks up the apple object. The green rectangles in the masks (bottom images) represent the ROI (shown only for illustration). In the last two frames, the agent lifts up the apple object after a successful grasp.
  • Figure 5: Sim-to-real using UR10e and real-world learning from scratch using Franka.