Table of Contents
Fetching ...

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

TL;DR

M2T2 introduces a unified multi-task masked transformer that learns object-centric 6-DoF grasping and orientation-aware placing directly from scene point clouds. The model leverages a scene encoder, a masked contact decoder, and per-task losses to generate diverse, collision-free poses, with language-conditioned extensions for RLBench tasks. Trained on a large synthetic dataset, M2T2 demonstrates zero-shot sim2real transfer and outperforms task-specific baselines in real robot experiments and RLBench benchmarks, including challenging object re-orientation placements. The work argues for a modular, language-augmented open-world manipulation system by unifying multiple action primitives under a single architecture.

Abstract

With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about 19% in overall performance and 37.5% in challenging scenes where the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available on our project website https://m2-t2.github.io.

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

TL;DR

M2T2 introduces a unified multi-task masked transformer that learns object-centric 6-DoF grasping and orientation-aware placing directly from scene point clouds. The model leverages a scene encoder, a masked contact decoder, and per-task losses to generate diverse, collision-free poses, with language-conditioned extensions for RLBench tasks. Trained on a large synthetic dataset, M2T2 demonstrates zero-shot sim2real transfer and outperforms task-specific baselines in real robot experiments and RLBench benchmarks, including challenging object re-orientation placements. The work argues for a modular, language-augmented open-world manipulation system by unifying multiple action primitives under a single architecture.

Abstract

With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about 19% in overall performance and 37.5% in challenging scenes where the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available on our project website https://m2-t2.github.io.
Paper Structure (33 sections, 8 equations, 8 figures, 2 tables)

This paper contains 33 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We propose M2T2, a unified model for learning multiple action primitives. M2T2 takes a raw 3D point cloud and predicts 6-DoF grasps per-object (lower left) and orientation-aware placements (lower right, where green means the object can fit in any orientation and yellow means only a subset of orientations are possible). Colors on the point clouds are for visualization only.
  • Figure 2: M2T2 generates valid gripper poses for grasping and placing with a single model. First, a 3D network (scene encoder) takes the scene point cloud and produces multi-scale feature maps. Then, the features are cross-attended with learnable query tokens via a transformer (contact decoder). Finally, the output tokens are multiplied with per-point features and generate contact masks and gripper poses for each object (for grasping) and each orientation (for placing). For grasping, addition MLPs are applied to the output tokens and per-point features to predict objectness scores (to filter out non-object proposals) and grasp parameters (to reconstruct gripper poses). Optionally, the contact decoder can take a set of tokens encoding language goals to produce goal-conditioned grasping and placing poses.
  • Figure 3: M2T2 outperforms task-specific models -- Contact-GraspNet sundermeyer2021contact for grasping and CabiNet murali2023cabinet for placing -- on objects from seen categories (a,c) and unseen categories (b,d).
  • Figure 4: Ablation studies
  • Figure 5: Our robot experimental setup. Left: Scenes where the target object (highlighted in red) needs to be reoriented to be placed in the placement region (shown in green). Right: A example of a scene where objects are sequentially moved from the right to the initially empty region on the left.
  • ...and 3 more figures