Table of Contents
Fetching ...

A comparison of visual representations for real-world reinforcement learning in the context of vacuum gripping

Nico Sutter, Valentin N. Hartmann, Stelian Coros

TL;DR

This work investigates how different perception encoders—2D RGB/D versus 3D voxel grids—affect real-world RL for vacuum-gripper grasping. Using SERL as a backbone, the study shows that spatial voxel-based representations outperform purely visual inputs in both seen and unseen scenarios, with strong generalization when using pre-trained VoxNet backbones and observation-space symmetries. The results highlight the importance of 3D spatial perception for reliable, reactive manipulation in unstructured environments and point to future directions involving alternative 3D architectures and broader tasks. The findings have practical implications for improving throughput and robustness in industrial robotic grasping with suction devices.

Abstract

When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on comparing real-world vision with 3D scene inputs, exploring new architectures in the process. We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper. The code and videos of the evaluations are available at https://github.com/nisutte/voxel-serl.

A comparison of visual representations for real-world reinforcement learning in the context of vacuum gripping

TL;DR

This work investigates how different perception encoders—2D RGB/D versus 3D voxel grids—affect real-world RL for vacuum-gripper grasping. Using SERL as a backbone, the study shows that spatial voxel-based representations outperform purely visual inputs in both seen and unseen scenarios, with strong generalization when using pre-trained VoxNet backbones and observation-space symmetries. The results highlight the importance of 3D spatial perception for reliable, reactive manipulation in unstructured environments and point to future directions involving alternative 3D architectures and broader tasks. The findings have practical implications for improving throughput and robustness in industrial robotic grasping with suction devices.

Abstract

When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on comparing real-world vision with 3D scene inputs, exploring new architectures in the process. We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper. The code and videos of the evaluations are available at https://github.com/nisutte/voxel-serl.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (left) Voxel grid representation of the suction gripper and the box that is used as spatial observation for the learned policy. (right) Robot arm with vacuum gripper grasping a box.
  • Figure 2: Illustration of the Behavior Tree for grasping a box.
  • Figure 3: Observation space examples at 3 time-steps during a trajectory.
  • Figure 4: Observation (obs) and action rotation behavior during training and inference.
  • Figure 5: Experiment setup for training (left) and testing (right).