Table of Contents
Fetching ...

Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real

Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, Vikash Kumar

TL;DR

This paper tackles multi-agent manipulation via locomotion by introducing hierarchical sim2real, where a low-level locomotion policy is learned in simulation and a high-level controller learns task directives that steer the low-level policy. Training is performed in two phases and leverages domain randomization at each level to achieve zero-shot transfer to real-world robots. The method is validated on three real-world quadrupedal tasks—Avoid, Push, and Coordinate—showing that hierarchy plus targeted randomization yields robust real-world performance, including a successful demonstration of coordinated multi-agent manipulation. The work highlights modularity in sim2real and suggests that hierarchical structures simplify bridging the sim-to-real gap for complex, interactive robotics tasks.

Abstract

Manipulation and locomotion are closely related problems that are often studied in isolation. In this work, we study the problem of coordinating multiple mobile agents to exhibit manipulation behaviors using a reinforcement learning (RL) approach. Our method hinges on the use of hierarchical sim2real -- a simulated environment is used to learn low-level goal-reaching skills, which are then used as the action space for a high-level RL controller, also trained in simulation. The full hierarchical policy is then transferred to the real world in a zero-shot fashion. The application of domain randomization during training enables the learned behaviors to generalize to real-world settings, while the use of hierarchy provides a modular paradigm for learning and transferring increasingly complex behaviors. We evaluate our method on a number of real-world tasks, including coordinated object manipulation in a multi-agent setting. See videos at https://sites.google.com/view/manipulation-via-locomotion

Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real

TL;DR

This paper tackles multi-agent manipulation via locomotion by introducing hierarchical sim2real, where a low-level locomotion policy is learned in simulation and a high-level controller learns task directives that steer the low-level policy. Training is performed in two phases and leverages domain randomization at each level to achieve zero-shot transfer to real-world robots. The method is validated on three real-world quadrupedal tasks—Avoid, Push, and Coordinate—showing that hierarchy plus targeted randomization yields robust real-world performance, including a successful demonstration of coordinated multi-agent manipulation. The work highlights modularity in sim2real and suggests that hierarchical structures simplify bridging the sim-to-real gap for complex, interactive robotics tasks.

Abstract

Manipulation and locomotion are closely related problems that are often studied in isolation. In this work, we study the problem of coordinating multiple mobile agents to exhibit manipulation behaviors using a reinforcement learning (RL) approach. Our method hinges on the use of hierarchical sim2real -- a simulated environment is used to learn low-level goal-reaching skills, which are then used as the action space for a high-level RL controller, also trained in simulation. The full hierarchical policy is then transferred to the real world in a zero-shot fashion. The application of domain randomization during training enables the learned behaviors to generalize to real-world settings, while the use of hierarchy provides a modular paradigm for learning and transferring increasingly complex behaviors. We evaluate our method on a number of real-world tasks, including coordinated object manipulation in a multi-agent setting. See videos at https://sites.google.com/view/manipulation-via-locomotion

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Multiple agents coordinate to move a large object.
  • Figure 2: We consider three quadrupedal locomotion tasks of increasing complexity, utilizing the D'Kitty robot (see Section \ref{['sec:hardware']} for details on this robot). From left to right, we present the simulated (top row, using MuJoCo mujoco) and real-world (bottom row) versions of the three tasks: Avoid, in which the quadruped must walk to a target location while avoiding a block object; Push, in which a quadruped must push a block object to a desired location; and Coordinate, in which two quadrupeds coordinate to push a long block to a target location and orientation. We utilize HTC Vive controllers and trackers to track the real-world position and orientation of agents, objects, and (for Avoid and Push) the desired target locations.
  • Figure 3: We propose to solve tasks using a hierarchical policy in which a high-level policy $\pi_{\mathrm{hi}}$ produces high-level actions $a_{\mathrm{hi}}$ which are transformed to goals $g$ that a lower-level policy $\pi_{\mathrm{lo}}$ is trained to reach. In quadrupedal locomotion tasks, this is a natural decomposition: the low-level policy may be trained to produce behaviors to reach various goals; i.e., $g:=(g_x,g_y)$ is a desired point and the state representation $f(s)$ is simply the $x,y$ coordinates of the quadruped. The high-level policy then solves a task by iteratively directing the low-level to a sequence of goal locations.
  • Figure 4: The low-level policy is trained to perform simple goal-reaching in simulation (left). We apply a variety of domain randomizations, including randomized height fields, which we found to be most helpful (middle). These randomizations lead to a robust locomotion policy in a variety of real-world environments, including both indoor and outdoor terrain (right).
  • Figure 5: The high-level policy directs the agent by setting relative position goals. In this way, it combines locomotive primitives to solve a more complex task, as in this example trajectory for the Avoid task. To account for potential unknown gaps between low-level behavior in simulation and reality (blue vs. orange arrows), during training we pollute the high-level actions with random noise.