Grasp, See, and Place: Efficient Unknown Object Rearrangement with Policy Structure Prior
Kechun Xu, Zhongxiang Zhou, Jun Wu, Haojian Lu, Rong Xiong, Yue Wang
TL;DR
This work tackles unknown object rearrangement under perception noise by introducing GSP, a dual-loop policy with a See inner loop for self-confident in-hand matching and an outer loop for Grasp and Place planning. The method is guided by a structure prior that decouples perception effects on grasp and place, enabling targeted improvement of in-hand matching via an active See policy and task-level RL with CLIP-based object matching. The approach demonstrates higher task completion rates and fewer steps than strong baselines in both simulation and real-world tests, and shows robust generalization to unseen objects and diverse noise conditions. The integration of CLIP for zero-shot matching and self-termination, together with a planner that leverages buffers to resolve circular dependencies, provides a practical, scalable solution for unknown-object rearrangement in cluttered environments.
Abstract
We focus on the task of unknown object rearrangement, where a robot is supposed to re-configure the objects into a desired goal configuration specified by an RGB-D image. Recent works explore unknown object rearrangement systems by incorporating learning-based perception modules. However, they are sensitive to perception error, and pay less attention to task-level performance. In this paper, we aim to develop an effective system for unknown object rearrangement amidst perception noise. We theoretically reveal that the noisy perception impacts grasp and place in a decoupled way, and show such a decoupled structure is valuable to improve task optimality. We propose GSP, a dual-loop system with the decoupled structure as prior. For the inner loop, we learn a see policy for self-confident in-hand object matching. For the outer loop, we learn a grasp policy aware of object matching and grasp capability guided by task-level rewards. We leverage the foundation model CLIP for object matching, policy learning and self-termination. A series of experiments indicate that GSP can conduct unknown object rearrangement with higher completion rates and fewer steps.
