Table of Contents
Fetching ...

CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects

Yoonyoung Cho, Junhyek Han, Yoontae Cho, Beomjoon Kim

TL;DR

This work tackles nonprehensile manipulation across diverse unseen objects by introducing CORN, a contact-informed object representation learned via a collision-prediction pretraining task over a patch-based point-cloud encoder. A teacher policy, trained with privileged information, guides a student policy through distillation to operate with partial real-world observations, enabling zero-shot transfer from simulation. The approach combines a patch-transformer backbone with a collision-aware pretraining objective, achieving data- and time-efficient learning and enabling scalable parallel RL across thousands of environments. Results show state-of-the-art performance in simulation and robust sim-to-real transfer to unseen objects, highlighting CORN's potential for versatile real-world nonprehensile manipulation.

Abstract

Nonprehensile manipulation is essential for manipulating objects that are too thin, large, or otherwise ungraspable in the wild. To sidestep the difficulty of contact modeling in conventional modeling-based approaches, reinforcement learning (RL) has recently emerged as a promising alternative. However, previous RL approaches either lack the ability to generalize over diverse object shapes, or use simple action primitives that limit the diversity of robot motions. Furthermore, using RL over diverse object geometry is challenging due to the high cost of training a policy that takes in high-dimensional sensory inputs. We propose a novel contact-based object representation and pretraining pipeline to tackle this. To enable massively parallel training, we leverage a lightweight patch-based transformer architecture for our encoder that processes point clouds, thus scaling our training across thousands of environments. Compared to learning from scratch, or other shape representation baselines, our representation facilitates both time- and data-efficient learning. We validate the efficacy of our overall system by zero-shot transferring the trained policy to novel real-world objects. Code and videos are available at https://sites.google.com/view/contact-non-prehensile.

CORN: Contact-based Object Representation for Nonprehensile Manipulation of General Unseen Objects

TL;DR

This work tackles nonprehensile manipulation across diverse unseen objects by introducing CORN, a contact-informed object representation learned via a collision-prediction pretraining task over a patch-based point-cloud encoder. A teacher policy, trained with privileged information, guides a student policy through distillation to operate with partial real-world observations, enabling zero-shot transfer from simulation. The approach combines a patch-transformer backbone with a collision-aware pretraining objective, achieving data- and time-efficient learning and enabling scalable parallel RL across thousands of environments. Results show state-of-the-art performance in simulation and robust sim-to-real transfer to unseen objects, highlighting CORN's potential for versatile real-world nonprehensile manipulation.

Abstract

Nonprehensile manipulation is essential for manipulating objects that are too thin, large, or otherwise ungraspable in the wild. To sidestep the difficulty of contact modeling in conventional modeling-based approaches, reinforcement learning (RL) has recently emerged as a promising alternative. However, previous RL approaches either lack the ability to generalize over diverse object shapes, or use simple action primitives that limit the diversity of robot motions. Furthermore, using RL over diverse object geometry is challenging due to the high cost of training a policy that takes in high-dimensional sensory inputs. We propose a novel contact-based object representation and pretraining pipeline to tackle this. To enable massively parallel training, we leverage a lightweight patch-based transformer architecture for our encoder that processes point clouds, thus scaling our training across thousands of environments. Compared to learning from scratch, or other shape representation baselines, our representation facilitates both time- and data-efficient learning. We validate the efficacy of our overall system by zero-shot transferring the trained policy to novel real-world objects. Code and videos are available at https://sites.google.com/view/contact-non-prehensile.
Paper Structure (20 sections, 10 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Snapshot of our robot performing diverse real-world object manipulation tasks. The first column shows the initial and goal object pose (transparent) of the task, and the green region marks the robot's workspace. Row 1: Raising a cup bottom-side up. The cup's grasp and placement poses are occluded by the collision with the table; since the cup is dynamically unstable, the robot must also support the object to avoid toppling the object out of the workspace during manipulation. Row 2: Flipping the wipe dispenser upside down, which is too wide to grasp. Since it may roll erratically, frequent re-contacts and dense closed-loop motions are required to stabilize the object during manipulation. Row 3: Moving a book that is too thin to be grasped; to drag the book, the robot must finely adjust the pressure to allow for both reorientation and translation. Row 4: Flipping a toy crab. Given its concave geometry, the robot must utilize the interior contacts to pivot the crab.
  • Figure 2: Even among similar-looking states, the interaction outcome varies drastically depending on the presence of contact. (a-left) the gripper passes near the cup, yet not quite in contact. As the robot narrowly misses the object, the object remains still. (a-right) the gripper engages with the cup, leading to a successful topple. (b-left) The robot relocates to the left of the block to push it to the goal(dark). By avoiding unintended collision, it is well-positioned to push the object. (b-right) due to spurious contact, the gripper accidentally topples the block, making the goal farther to reach.
  • Figure 3: Our real-world (left) and simulated (right) domains.
  • Figure 4: Our system and model architecture. The contact network consists of a point cloud encoder (red) and contact-prediction decoder (green), passing the point cloud embeddings to the teacher policy module (blue). Student module (orange, omitted) is detailed in Section \ref{['sec:student-model']}.
  • Figure 5: Set of 16 real-world objects that we test in the real world.
  • ...and 5 more figures