Table of Contents
Fetching ...

Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity

Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, Vikas Sindhwani

Abstract

We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for grasping. With semantic and physical safety in mind, these modules are interfaced with a real-time trajectory optimizer and a compliant tracking controller to enable human-robot proximity. We demonstrate performance for the following tasks: bi-arm sorting, bottle opening, and trash disposal tasks. These are done zero-shot where the models used have not been trained with any real world data from this bi-arm robot, scenes or workspace. Composing both learning- and non-learning-based components in a modular fashion with interpretable inputs and outputs allows the user to easily debug points of failures and fragilities. One may also in-place swap modules to improve the robustness of the overall platform, for instance with imitation-learned policies. Please see https://sites.google.com/corp/view/safe-robots .

Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity

Abstract

We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for grasping. With semantic and physical safety in mind, these modules are interfaced with a real-time trajectory optimizer and a compliant tracking controller to enable human-robot proximity. We demonstrate performance for the following tasks: bi-arm sorting, bottle opening, and trash disposal tasks. These are done zero-shot where the models used have not been trained with any real world data from this bi-arm robot, scenes or workspace. Composing both learning- and non-learning-based components in a modular fashion with interpretable inputs and outputs allows the user to easily debug points of failures and fragilities. One may also in-place swap modules to improve the robustness of the overall platform, for instance with imitation-learned policies. Please see https://sites.google.com/corp/view/safe-robots .
Paper Structure (16 sections, 2 equations, 6 figures, 2 tables)

This paper contains 16 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Left: Sorting Task; the bi-arm robot follows open-ended long-horizon natural language instructions like "Move the metal objects to the left side.". Middle: Bottle Opening, involving a twisting primitive with one arm, and a stable hold of the bottle with the other, where variable compliance is used for the holding and twisting arms. Right: Trash disposal, where the right arm stiffly presses on the foot pedal while the left arm places objects into the bin.
  • Figure 2: A Modular Bi-arm Embodied AI System. A list of the active scene objects together with the user's instruction, is passed to an LLM Task Planner. Leveraging in-context learning on example API usage, the planner generates a sequence of high-level executable commands for the robot. Each high-level command, e.g., pick-and-place, leverages a state machine that orchestrates VLM-Point Cloud (VLM-PC) perception-based helper functions, and a collection of bi-arm Skills such as pick, handover, place. Finally, within each Skill, filtered point clouds from the Perception module are combined with Point Cloud Transformer (PCT)-based grasping policies and SQP-based motion planning to generate joint-space trajectories for both arms, tracked using a compliant controller.
  • Figure 3: Example execution of a successfully generated bi-arm sorting LLM plan to "move the metallic objects to the left side". The robot switches to picking with the right arm after it fails to reach the can with its left arm, and then performs a handover to the left arm for placing.
  • Figure 4: Bottle opening task conducted by our manipulation robot. Upper row: OWL-ViT's part-specific detection of the bottle and cap. Lower row: Execution of the state-machine. Aside from OWL-ViT, no other learning was involved.
  • Figure 5: OWL-ViT Lid localizations for held-out bottles, are highlighted with green bounding boxes. Red border indicates bottles the robot failed to open while blue indicates bottles the robot successfully opened.
  • ...and 1 more figures