Neural Rearrangement Planning for Object Retrieval from Confined Spaces Perceivable by Robot's In-hand RGB-D Sensor
Hanwen Ren, Ahmed H. Qureshi
TL;DR
The paper tackles retrieving a target object $o^*$ from unknown confined spaces using an in-hand RGB-D perceptual pipeline and a neural rearrangement planning framework. It introduces two neural modules, Object Selection Network (OSNet) to identify blockers and Region Proposal Network (RPNet) to propose relocation regions that preserve path homotopy to the target, enabling fast, learned planning. The approach integrates active sensing, GraspNet-driven grasp poses, and reachability checks (RRT-Connect) with an iterative rearrangement loop until the target becomes reachable. Empirically, the neural planner substantially outperforms baselines in success rate and planning speed (notably orders of magnitude faster) and demonstrates sim-to-real transfer in cabinet-like real-world scenes.
Abstract
Rearrangement planning for object retrieval tasks from confined spaces is a challenging problem, primarily due to the lack of open space for robot motion and limited perception. Several traditional methods exist to solve object retrieval tasks, but they require overhead cameras for perception and a time-consuming exhaustive search to find a solution and often make unrealistic assumptions, such as having identical, simple geometry objects in the environment. This paper presents a neural object retrieval framework that efficiently performs rearrangement planning of unknown, arbitrary objects in confined spaces to retrieve the desired object using a given robot grasp. Our method actively senses the environment with the robot's in-hand camera. It then selects and relocates the non-target objects such that they do not block the robot path homotopy to the target object, thus also aiding an underlying path planner in quickly finding robot motion sequences. Furthermore, we demonstrate our framework in challenging scenarios, including real-world cabinet-like environments with arbitrary household objects. The results show that our framework achieves the best performance among all presented methods and is, on average, two orders of magnitude computationally faster than the best-performing baselines.
