Table of Contents
Fetching ...

Grasp, See, and Place: Efficient Unknown Object Rearrangement with Policy Structure Prior

Kechun Xu, Zhongxiang Zhou, Jun Wu, Haojian Lu, Rong Xiong, Yue Wang

TL;DR

This work tackles unknown object rearrangement under perception noise by introducing GSP, a dual-loop policy with a See inner loop for self-confident in-hand matching and an outer loop for Grasp and Place planning. The method is guided by a structure prior that decouples perception effects on grasp and place, enabling targeted improvement of in-hand matching via an active See policy and task-level RL with CLIP-based object matching. The approach demonstrates higher task completion rates and fewer steps than strong baselines in both simulation and real-world tests, and shows robust generalization to unseen objects and diverse noise conditions. The integration of CLIP for zero-shot matching and self-termination, together with a planner that leverages buffers to resolve circular dependencies, provides a practical, scalable solution for unknown-object rearrangement in cluttered environments.

Abstract

We focus on the task of unknown object rearrangement, where a robot is supposed to re-configure the objects into a desired goal configuration specified by an RGB-D image. Recent works explore unknown object rearrangement systems by incorporating learning-based perception modules. However, they are sensitive to perception error, and pay less attention to task-level performance. In this paper, we aim to develop an effective system for unknown object rearrangement amidst perception noise. We theoretically reveal that the noisy perception impacts grasp and place in a decoupled way, and show such a decoupled structure is valuable to improve task optimality. We propose GSP, a dual-loop system with the decoupled structure as prior. For the inner loop, we learn a see policy for self-confident in-hand object matching. For the outer loop, we learn a grasp policy aware of object matching and grasp capability guided by task-level rewards. We leverage the foundation model CLIP for object matching, policy learning and self-termination. A series of experiments indicate that GSP can conduct unknown object rearrangement with higher completion rates and fewer steps.

Grasp, See, and Place: Efficient Unknown Object Rearrangement with Policy Structure Prior

TL;DR

This work tackles unknown object rearrangement under perception noise by introducing GSP, a dual-loop policy with a See inner loop for self-confident in-hand matching and an outer loop for Grasp and Place planning. The method is guided by a structure prior that decouples perception effects on grasp and place, enabling targeted improvement of in-hand matching via an active See policy and task-level RL with CLIP-based object matching. The approach demonstrates higher task completion rates and fewer steps than strong baselines in both simulation and real-world tests, and shows robust generalization to unseen objects and diverse noise conditions. The integration of CLIP for zero-shot matching and self-termination, together with a planner that leverages buffers to resolve circular dependencies, provides a practical, scalable solution for unknown-object rearrangement in cluttered environments.

Abstract

We focus on the task of unknown object rearrangement, where a robot is supposed to re-configure the objects into a desired goal configuration specified by an RGB-D image. Recent works explore unknown object rearrangement systems by incorporating learning-based perception modules. However, they are sensitive to perception error, and pay less attention to task-level performance. In this paper, we aim to develop an effective system for unknown object rearrangement amidst perception noise. We theoretically reveal that the noisy perception impacts grasp and place in a decoupled way, and show such a decoupled structure is valuable to improve task optimality. We propose GSP, a dual-loop system with the decoupled structure as prior. For the inner loop, we learn a see policy for self-confident in-hand object matching. For the outer loop, we learn a grasp policy aware of object matching and grasp capability guided by task-level rewards. We leverage the foundation model CLIP for object matching, policy learning and self-termination. A series of experiments indicate that GSP can conduct unknown object rearrangement with higher completion rates and fewer steps.
Paper Structure (39 sections, 3 theorems, 34 equations, 19 figures, 5 tables)

This paper contains 39 sections, 3 theorems, 34 equations, 19 figures, 5 tables.

Key Result

Theorem 1

Given a tabletop object rearrangement problem from the configuration of $M$ objects $\mathcal{O}_c$ to that of $N$ objects $\mathcal{O}_g$, $\pi^0$ is an optimal policy under ideal perception.

Figures (19)

  • Figure 1: Grasp, See, and Place. The robot is given the initial and goal scenes for the task of object rearrangement. Aiming at improving task-level performance with perception noise, we first derive the decoupled structure by analysis. Guided by the decoupled prior, we incorporate human behavior and task-level rewards into the general framework of GSP. In general, GSP contains two loops: the inner loop actively sees the grasped object for high self-confident matching, and the outer loop conducts the grasp and place planning.
  • Figure 2: System Overview. Given the RGB-D images of the current and goal scenes, the grasp policy jointly considers object matching and candidate grasps to determine a selected grasp pose. After picking up an object, object matching is conducted between the grasped object and the goal objects. If the matching is self-confident, the object is rearranged to the planned place pose based on occupancy checking. Otherwise, active perception is triggered to predict the delta orientation of the end effector. Then the robot rotates the in-hand object to a new view until a confident matching is achieved. Overall, our method decomposes the object rearrangement process into two loops: an inner loop for see and an outer loop for grasp and place planning.
  • Figure 3: An example to illustrate circular dependency and buffer. After moving $c_4$ to the buffer (marked with the white box), the circular dependency breaks.
  • Figure 4: Architecture of the inner loop, which consists of four key components: object matching, self-termination, see policy, and reward formulation. The four components form a closed loop for self-confident object matching.
  • Figure 5: State representation of the see policy. The robot is supposed to rotate the tomato soup can for a confident matching to a set of goal objects. We visualize two cases of scene representation including (a) wrong matching and (b) correct matching with the same flow field color coding in teed2020raft. For each case, optical flows $f$ are generated between the grasped object padding image $\tilde{o}_h$ and the current-matched object padding image $\tilde{o}_g^{j_h}$ (object flow), as well as the goal image $I_g$ (global flow). Then their delta flow $\Delta f$ is the state representation of the see policy, marked with the average magnitude. Note that the image crop of the grasped object is disturbed by the gripper, thus bringing noise for matching.
  • ...and 14 more figures

Theorems & Definitions (7)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Lemma 1
  • proof
  • proof