A comparison of visual representations for real-world reinforcement learning in the context of vacuum gripping
Nico Sutter, Valentin N. Hartmann, Stelian Coros
TL;DR
This work investigates how different perception encoders—2D RGB/D versus 3D voxel grids—affect real-world RL for vacuum-gripper grasping. Using SERL as a backbone, the study shows that spatial voxel-based representations outperform purely visual inputs in both seen and unseen scenarios, with strong generalization when using pre-trained VoxNet backbones and observation-space symmetries. The results highlight the importance of 3D spatial perception for reliable, reactive manipulation in unstructured environments and point to future directions involving alternative 3D architectures and broader tasks. The findings have practical implications for improving throughput and robustness in industrial robotic grasping with suction devices.
Abstract
When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on comparing real-world vision with 3D scene inputs, exploring new architectures in the process. We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper. The code and videos of the evaluations are available at https://github.com/nisutte/voxel-serl.
