Table of Contents
Fetching ...

Rapid Motor Adaptation for Robotic Manipulator Arms

Yichao Liang, Kevin Ellis, João Henriques

TL;DR

The paper presents Rapid Motor Adaptation for Robotic Manipulator Arms (RMA$^2$), extending RMA to dexterous manipulation by learning geometry-aware priors via category-instance dictionaries and a depth-based adapter that infers environment embeddings from history and depth imagery. The method uses a two-phase training scheme: (i) a policy conditioned on privileged environment parameters learned with PPO, and (ii) an adapter that predicts the embedding from past observations and depth data, enabling deployment without privileged information. Across four ManiSkill2 tasks (Pick & Place on YCB/EGAD, Peg Insertion, Faucet Turning), RMA$^2$ consistently outperforms domain-randomization baselines and ablations, while approaching an Oracle upper bound. The work demonstrates improved generalization and sample efficiency, highlighting the value of geometry-aware embeddings and depth-informed adaptation for robust, real-world manipulation.

Abstract

Developing generalizable manipulation skills is a core challenge in embodied AI. This includes generalization across diverse task configurations, encompassing variations in object shape, density, friction coefficient, and external disturbances such as forces applied to the robot. Rapid Motor Adaptation (RMA) offers a promising solution to this challenge. It posits that essential hidden variables influencing an agent's task performance, such as object mass and shape, can be effectively inferred from the agent's action and proprioceptive history. Drawing inspiration from RMA in locomotion and in-hand rotation, we use depth perception to develop agents tailored for rapid motor adaptation in a variety of manipulation tasks. We evaluated our agents on four challenging tasks from the Maniskill2 benchmark, namely pick-and-place operations with hundreds of objects from the YCB and EGAD datasets, peg insertion with precise position and orientation, and operating a variety of faucets and handles, with customized environment variations. Empirical results demonstrate that our agents surpass state-of-the-art methods like automatic domain randomization and vision-based policies, obtaining better generalization performance and sample efficiency.

Rapid Motor Adaptation for Robotic Manipulator Arms

TL;DR

The paper presents Rapid Motor Adaptation for Robotic Manipulator Arms (RMA), extending RMA to dexterous manipulation by learning geometry-aware priors via category-instance dictionaries and a depth-based adapter that infers environment embeddings from history and depth imagery. The method uses a two-phase training scheme: (i) a policy conditioned on privileged environment parameters learned with PPO, and (ii) an adapter that predicts the embedding from past observations and depth data, enabling deployment without privileged information. Across four ManiSkill2 tasks (Pick & Place on YCB/EGAD, Peg Insertion, Faucet Turning), RMA consistently outperforms domain-randomization baselines and ablations, while approaching an Oracle upper bound. The work demonstrates improved generalization and sample efficiency, highlighting the value of geometry-aware embeddings and depth-informed adaptation for robust, real-world manipulation.

Abstract

Developing generalizable manipulation skills is a core challenge in embodied AI. This includes generalization across diverse task configurations, encompassing variations in object shape, density, friction coefficient, and external disturbances such as forces applied to the robot. Rapid Motor Adaptation (RMA) offers a promising solution to this challenge. It posits that essential hidden variables influencing an agent's task performance, such as object mass and shape, can be effectively inferred from the agent's action and proprioceptive history. Drawing inspiration from RMA in locomotion and in-hand rotation, we use depth perception to develop agents tailored for rapid motor adaptation in a variety of manipulation tasks. We evaluated our agents on four challenging tasks from the Maniskill2 benchmark, namely pick-and-place operations with hundreds of objects from the YCB and EGAD datasets, peg insertion with precise position and orientation, and operating a variety of faucets and handles, with customized environment variations. Empirical results demonstrate that our agents surpass state-of-the-art methods like automatic domain randomization and vision-based policies, obtaining better generalization performance and sample efficiency.
Paper Structure (25 sections, 4 equations, 4 figures, 2 tables)

This paper contains 25 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Visualization of an action trajectory by RMA$^2$ in each of the four tasks. The top two trajectories also depict the corresponding low-resolution depth images as seen by the adapter module. We highlight a few interesting behaviors. In the first trajectory, for the Pick & Place task (YCB dataset), the agent first attempts to pick up a cup by the rim. This fails because the rim, in this instance of randomization, is too wide for its gripper. The agent then reattempted by grasping it by the handle, which succeeded. In the second trajectory, from the Faucet Turning task, we see the agent did not grasp the handle, but only pushed it with one finger to rotate it. The depth image shows the precise positioning of the end effector. In the third trajectory, we see the agent did not aim correctly for insertion on the first attempt. This is due to the external disturbances applied to the peg, and the fact that the hole has a very small clearance at the level of millimeters. But it succeeded after "jiggling" the peg around the correct position, a strategy that mimics human behavior. In the fourth trajectory, Pick & Place (EGAD dataset), the agent attempts to pick up a previously-unseen EGAD object. The object is too wide for the agent to grasp it from the top, as it lays flat on the floor (a zoomed in inset picture is shown). The agent picks up the object by pressing the left side of the object with its left finger and inserting its right finger beneath the object, which is a fair strategy to pick up a flat object.
  • Figure 2: Overview of the proposed training procedure, which consists of 2 phases. In the first phase, a conditional policy$\pi$ is trained to maximize a reward (e.g. move an object to a given position or orientation), given observations $x_t$ (e.g. joint angles), a goal description $g$ and privileged information about the environment $e$, $s_t$. The environment is randomized (e.g. varying mass or object identities), so an environment encoder $\mu$ is trained jointly to distill this privileged information into an embedding $z_t$. In the 2nd phase, the policy $\pi$ and encoder $\mu$ are frozen, and an adapter $\phi$ and CNN $\psi$ are trained with a $L^2$ loss to predict the privileged information in $z_t$ from just a history of observations (e.g. past dynamic behaviour) and a depth image $d_t$ (e.g. object appearance). The adapter, CNN and policy can be deployed to perform adaptive manipulation directly from observations and depth images.
  • Figure 3: (a) Example objects from the EGAD dataset, sorted by grasp and shape complexity. This illustrates the array of diverse shapes. The horizontal axis indicates ascending shape complexity, while the vertical axis corresponds to increasing grasp complexity. (b) Fine-grained evaluation of the performance of RMA$^2$ (left) and DR+Vi (right) on Maniskills2's Pick & Place task, with EGAD objects. The color coding reflects the success rate (bright yellow for 100%, dark blue for 0%), averaged over 500 runs. The white cells corresponds to objects that are not in the dataset for this task.
  • Figure 4: Figure (a) illustrates the YCB object we use in the Pick and Place task. Figure (b), (c) shows sample faucets for the Faucet Turning task.