Table of Contents
Fetching ...

State Representations as Incentives for Reinforcement Learning Agents: A Sim2Real Analysis on Robotic Grasping

Panagiotis Petropoulakis, Ludwig Gräf, Mohammadhossein Malmir, Josip Josifovski, Alois Knoll

TL;DR

The paper investigates how different state representations influence reinforcement learning for robotic grasping and sim2real transfer. By constructing a continuum of representations from hand-crafted numerical states to image-based latent features and applying domain randomization, it compares performance across a model-based ideal baseline, numerical agents, and various vision-based approaches, including an Incentivized Grasping AutoEncoder (IGAE). Key findings show that task-specific, incentive-aligned representations steepen learning and improve sim2real robustness, with IGAE delivering the best vision-based sim2real performance (84% in real-world tests) while numerical states can match non-learning baselines. The work highlights the value of decoupling representation learning from policy learning and provides guidance for selecting state representations in robotics tasks requiring precise control and transfer across sim and real hardware.

Abstract

Choosing an appropriate representation of the environment for the underlying decision-making process of the reinforcement learning agent is not always straightforward. The state representation should be inclusive enough to allow the agent to informatively decide on its actions and disentangled enough to simplify policy training and the corresponding sim2real transfer. Given this outlook, this work examines the effect of various representations in incentivizing the agent to solve a specific robotic task: antipodal and planar object grasping. A continuum of state representations is defined, starting from hand-crafted numerical states to encoded image-based representations, with decreasing levels of induced task-specific knowledge. The effects of each representation on the ability of the agent to solve the task in simulation and the transferability of the learned policy to the real robot are examined and compared against a model-based approach with complete system knowledge. The results show that reinforcement learning agents using numerical states can perform on par with non-learning baselines. Furthermore, we find that agents using image-based representations from pre-trained environment embedding vectors perform better than end-to-end trained agents, and hypothesize that separation of representation learning from reinforcement learning can benefit sim2real transfer. Finally, we conclude that incentivizing the state representation with task-specific knowledge facilitates faster convergence for agent training and increases success rates in sim2real robot control.

State Representations as Incentives for Reinforcement Learning Agents: A Sim2Real Analysis on Robotic Grasping

TL;DR

The paper investigates how different state representations influence reinforcement learning for robotic grasping and sim2real transfer. By constructing a continuum of representations from hand-crafted numerical states to image-based latent features and applying domain randomization, it compares performance across a model-based ideal baseline, numerical agents, and various vision-based approaches, including an Incentivized Grasping AutoEncoder (IGAE). Key findings show that task-specific, incentive-aligned representations steepen learning and improve sim2real robustness, with IGAE delivering the best vision-based sim2real performance (84% in real-world tests) while numerical states can match non-learning baselines. The work highlights the value of decoupling representation learning from policy learning and provides guidance for selecting state representations in robotics tasks requiring precise control and transfer across sim and real hardware.

Abstract

Choosing an appropriate representation of the environment for the underlying decision-making process of the reinforcement learning agent is not always straightforward. The state representation should be inclusive enough to allow the agent to informatively decide on its actions and disentangled enough to simplify policy training and the corresponding sim2real transfer. Given this outlook, this work examines the effect of various representations in incentivizing the agent to solve a specific robotic task: antipodal and planar object grasping. A continuum of state representations is defined, starting from hand-crafted numerical states to encoded image-based representations, with decreasing levels of induced task-specific knowledge. The effects of each representation on the ability of the agent to solve the task in simulation and the transferability of the learned policy to the real robot are examined and compared against a model-based approach with complete system knowledge. The results show that reinforcement learning agents using numerical states can perform on par with non-learning baselines. Furthermore, we find that agents using image-based representations from pre-trained environment embedding vectors perform better than end-to-end trained agents, and hypothesize that separation of representation learning from reinforcement learning can benefit sim2real transfer. Finally, we conclude that incentivizing the state representation with task-specific knowledge facilitates faster convergence for agent training and increases success rates in sim2real robot control.
Paper Structure (13 sections, 4 equations, 5 figures, 4 tables)

This paper contains 13 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The same backbone architecture was consistently used in all the vision-based agents. This decision was made to assess the impact of various training objectives in shaping the original image space into meaningful representations of the environment. The architecture itself is an adaptation of the ResNet wickramasinghe2021resnet with a spatial SoftMax output layer levine2016endRuss2019.
  • Figure 2: State Representations Continuum: A model-based baseline with full system knowledge and multiple approaches having different state representation spaces with decreasing levels of system knowledge (increasing abstraction) are examined. Models on the left have the state information explicitly available in the form of numerical values for the robot joints or the object, whereas vision-centric techniques on the right generate implicit latent representation to guide the policy.
  • Figure 3: The Incentivized Grasping AutoEncoder (IGAE) takes an augmented image as input and performs reconstruction tasks. It reconstructs first the original (denoised) RGB image, then the gripper, and finally the object binary masks. The total loss is a weighted contribution of each reconstruction loss through $\lambda_i$ terms. In our settings, $\lambda_1$ is set to 1, $\lambda_2$ to 10, and $\lambda_3$ to 20.
  • Figure 4: Overview of the real-world setup (left). Image observation from the Kinect V2 camera (top), and the aligned image observation in the simulated environment (bottom).
  • Figure 5: The success rate development in training the proposed agents, each utilizing distinct representation space.