Table of Contents
Fetching ...

GET-USE: Learning Generalized Tool Usage for Bimanual Mobile Manipulation via Simulated Embodiment Extensions

Bohan Wu, Paul de La Sayette, Li Fei-Fei, Roberto Martín-Martín

TL;DR

GeT-USE addresses the problem of generalized tool usage for bimanual mobile manipulators by learning embodiment extensions in simulation to identify effective tool geometries, then distilling this knowledge into vision-based modules for real-world use. It introduces a two-step process: first train a tool-building policy π_gtb to extend the robot's end-effectors, then train a generalized tool selector D_gts, grasping policy π_gtg, and manipulation policy π_gtm to perform tool usage from depth images, enabling zero-shot sim-to-real transfer. The approach outperforms state-of-the-art crowd-sourced and procedurally generated tool baselines by 30-60% on three tasks (Sweeping, Hook_and_grasp, Decanting) on a 22-DOF TIAGo robot with 6-DOF end-effector control. This work significantly broadens robotics' capability to flexibly choose and use varied tools in unstructured environments, reducing reliance on handcrafted tools and curated datasets.

Abstract

The ability to use random objects as tools in a generalizable manner is a missing piece in robots' intelligence today to boost their versatility and problem-solving capabilities. State-of-the-art robotic tool usage methods focused on procedurally generating or crowd-sourcing datasets of tools for a task to learn how to grasp and manipulate them for that task. However, these methods assume that only one object is provided and that it is possible, with the correct grasp, to perform the task; they are not capable of identifying, grasping, and using the best object for a task when many are available, especially when the optimal tool is absent. In this work, we propose GeT-USE, a two-step procedure that learns to perform real-robot generalized tool usage by learning first to extend the robot's embodiment in simulation and then transferring the learned strategies to real-robot visuomotor policies. Our key insight is that by exploring a robot's embodiment extensions (i.e., building new end-effectors) in simulation, the robot can identify the general tool geometries most beneficial for a task. This learned geometric knowledge can then be distilled to perform generalized tool usage tasks by selecting and using the best available real-world object as tool. On a real robot with 22 degrees of freedom (DOFs), GeT-USE outperforms state-of-the-art methods by 30-60% success rates across three vision-based bimanual mobile manipulation tool-usage tasks.

GET-USE: Learning Generalized Tool Usage for Bimanual Mobile Manipulation via Simulated Embodiment Extensions

TL;DR

GeT-USE addresses the problem of generalized tool usage for bimanual mobile manipulators by learning embodiment extensions in simulation to identify effective tool geometries, then distilling this knowledge into vision-based modules for real-world use. It introduces a two-step process: first train a tool-building policy π_gtb to extend the robot's end-effectors, then train a generalized tool selector D_gts, grasping policy π_gtg, and manipulation policy π_gtm to perform tool usage from depth images, enabling zero-shot sim-to-real transfer. The approach outperforms state-of-the-art crowd-sourced and procedurally generated tool baselines by 30-60% on three tasks (Sweeping, Hook_and_grasp, Decanting) on a 22-DOF TIAGo robot with 6-DOF end-effector control. This work significantly broadens robotics' capability to flexibly choose and use varied tools in unstructured environments, reducing reliance on handcrafted tools and curated datasets.

Abstract

The ability to use random objects as tools in a generalizable manner is a missing piece in robots' intelligence today to boost their versatility and problem-solving capabilities. State-of-the-art robotic tool usage methods focused on procedurally generating or crowd-sourcing datasets of tools for a task to learn how to grasp and manipulate them for that task. However, these methods assume that only one object is provided and that it is possible, with the correct grasp, to perform the task; they are not capable of identifying, grasping, and using the best object for a task when many are available, especially when the optimal tool is absent. In this work, we propose GeT-USE, a two-step procedure that learns to perform real-robot generalized tool usage by learning first to extend the robot's embodiment in simulation and then transferring the learned strategies to real-robot visuomotor policies. Our key insight is that by exploring a robot's embodiment extensions (i.e., building new end-effectors) in simulation, the robot can identify the general tool geometries most beneficial for a task. This learned geometric knowledge can then be distilled to perform generalized tool usage tasks by selecting and using the best available real-world object as tool. On a real robot with 22 degrees of freedom (DOFs), GeT-USE outperforms state-of-the-art methods by 30-60% success rates across three vision-based bimanual mobile manipulation tool-usage tasks.

Paper Structure

This paper contains 11 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: GeT-USE: Generalized Tool-Usage via Simulated Embodiment Extensions.Left: a TIAGo robot achieves a bimanual mobile manipulation task that requires using other objects as tools: "Sweep the Carrot". In simulation, GeT-USE explores different embodiment extensions by looking at (indicated by the "camera" icon) and building on top of its two wrists ("L", "R") until it finds a suitable one. GeT-USE then transfers the successful strategy to the real world by learning vision-based modules to: 1) select the best available object (top right), 2) grasp it (bottom right), and 3) use it (bottom right), all based on real depth images. This methodology allows the robot to learn generalized tool usage tasks in simulation that require bimanual mobile manipulation, and zero-shot transfer to the real-world.
  • Figure 2: The GeT-USE Framework. Below we explain training (top) and deploying in the real-world (bottom) a generalized tool usage solution with GeT-USE. Training with GeT-USE (top) is a two-step procedure: In the first step, the agent is asked to solve a simulated version of the task (top, most left) and trains a generalized tool-building policy (top, second-left), $\pi_\mathit{gtb}$, that explores by extending its own embodiment by appending elements (blocks) until it finds an extension that can be used to perform the task. In the second step, the information of the tool-building policy is transferred to visual modules that can be used in the real world: i) a generalized tool selector (top, third-left), $\mathcal{D}_\mathit{gts}$, trained to predict the best grasping area (first generated blocks) of the generalized tools using successful (green tick) and failed generated tools (red cross), ii) a visuomotor generalized tool grasping policy (top, second-right), $\pi_\mathit{gtg}$, trained to grasp the tools created by $\pi_\mathit{gtb}$ using depth images as input and which success is detected automatically by a success detector, $\mathcal{D}_\mathit{gs}$, and iii) a visuomotor generalized tool manipulation policy (top, second-right), $\pi_\mathit{gtm}$ that learns to output bimanual mobile manipulation commands to control the robot and achieve the task with tool generated by $\pi_\mathit{gtb}$. These modules are directly transferred and applied in the real world (bottom): GeT-USE first captures a depth image of the table-top objects and generates object proposals. Each proposal is fed into GeT-USE's tool selector, $\mathcal{D}_\mathit{gts}$ that selects the best object to use as generalized tool. GeT-USE then uses the tool-grasping and manipulation policies, $\pi_\mathit{gtg}$ and $\pi_\mathit{gtm}$, to grasp and manipulate the selected tool to achieve the task. This process allows GeT-USE to create solutions that solve tasks requiring generalized tool usage by leveraging simulation and transferring into visual modules for real world execution.
  • Figure 3: Example rollouts of GeT-USE's tool-building policy for $\texttt{Sweeping}$, $\texttt{Hook\_and\_grasp}$, and $\texttt{Decanting}$ tasks. The bottom of each block details the tool geometry at timestep $t$ during the tool-building policy rollout. In Sweeping, GeT-USE builds one tool for each wrist with each timestep separated by a dashed line. The left (L) wrist is marked gray; the right (R) wrist is marked black. In Hook_and_grasp and Decanting, the robot builds only one tool on either arm. As such, GeT-USE incrementally builds complex tools using small blocks one timestep at a time.
  • Figure 4: Example object preferences output by GeT-USE's generalized tool selector module. GeT-USE's generalized tool selector ranks each object based on their suitability to serve as a sweeper in Sweeping (left), a hook in Hook_and_grasp (middle), and a decanter in Decanting (right). The green "1", "2", and "3" in each block represent the tool selector's preference with "1" being the highest ranked. In this way, GeT-USE "makes the best of what it has", even when the ideal tool does not exist on the table.
  • Figure 5: Simulated and Real-World Version of the Tasks.Left to right. Three tasks in our experiments: $\texttt{Sweeping}$, $\texttt{Hook\_and\_grasp}$, and $\texttt{Decanting}$. Top: example tool(s) generated by GeT-USE's trained policy next to a simulation version of the tasks used for embodiment exploration and policy training of bimanual mobile manipulation tasks. Bottom: example of the real-world version of the tasks. By learning from diverse tools and environments in simulation, GeT-USE successfully chooses the best tool in the real world, whether the ideal tool is present or not, and generalizes its policies to real-world objects, environments, and lighting conditions.
  • ...and 1 more figures