Table of Contents
Fetching ...

Learning Time-Optimal and Speed-Adjustable Tactile In-Hand Manipulation

Johannes Pitz, Lennart Röstel, Leon Sievers, Berthold Bäuml

TL;DR

This paper addresses the critical performance measure of the speed at which an in-hand manipulation can be performed, and presents reinforcement learning policies that can perform in-hand reorientation significantly faster than previous approaches for the complex setting of goal-conditioned reorientation in $\mathrm{SO}(3).

Abstract

In-hand manipulation with multi-fingered hands is a challenging problem that recently became feasible with the advent of deep reinforcement learning methods. While most contributions to the task brought improvements in robustness and generalization, this paper addresses the critical performance measure of the speed at which an in-hand manipulation can be performed. We present reinforcement learning policies that can perform in-hand reorientation significantly faster than previous approaches for the complex setting of goal-conditioned reorientation in SO(3) with permanent force closure and tactile feedback only (i.e., using the hand's torque and position sensors). Moreover, we show how policies can be trained to be speed-adjustable, allowing for setting the average orientation speed of the manipulated object during deployment. To this end, we present suitable and minimalistic reinforcement learning objectives for time-optimal and speed-adjustable in-hand manipulation, as well as an analysis based on extensive experiments in simulation. We also demonstrate the zero-shot transfer of the learned policies to the real DLR-Hand II with a wide range of target speeds and the fastest dextrous in-hand manipulation without visual inputs.

Learning Time-Optimal and Speed-Adjustable Tactile In-Hand Manipulation

TL;DR

This paper addresses the critical performance measure of the speed at which an in-hand manipulation can be performed, and presents reinforcement learning policies that can perform in-hand reorientation significantly faster than previous approaches for the complex setting of goal-conditioned reorientation in $\mathrm{SO}(3).

Abstract

In-hand manipulation with multi-fingered hands is a challenging problem that recently became feasible with the advent of deep reinforcement learning methods. While most contributions to the task brought improvements in robustness and generalization, this paper addresses the critical performance measure of the speed at which an in-hand manipulation can be performed. We present reinforcement learning policies that can perform in-hand reorientation significantly faster than previous approaches for the complex setting of goal-conditioned reorientation in SO(3) with permanent force closure and tactile feedback only (i.e., using the hand's torque and position sensors). Moreover, we show how policies can be trained to be speed-adjustable, allowing for setting the average orientation speed of the manipulated object during deployment. To this end, we present suitable and minimalistic reinforcement learning objectives for time-optimal and speed-adjustable in-hand manipulation, as well as an analysis based on extensive experiments in simulation. We also demonstrate the zero-shot transfer of the learned policies to the real DLR-Hand II with a wide range of target speeds and the fastest dextrous in-hand manipulation without visual inputs.

Paper Structure

This paper contains 15 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The DLR-Hand II Butterfass2001 is performing the complex task of reorienting a cube to a goal orientation for three different desired orientation speeds spanning a range of factor four (more examples are shown in the accompanying video). The time needed for reorienting the cube matches the desired speed. In the lower right, a closeup of the hand and the overall robotic setup with the humanoid Agile Justin is shown. All reorientations are performed purely tactile, using only the hand's position and torque sensors (no visual input, hence the blindfolded robot).
  • Figure 2: Overview of control architecture and system components. We use a learned state estimator $\rho$ to estimate the object pose $\hat{s}_t$ from proprioceptive (joints' torques and angles) observations $z_t$. Based on the estimated state, a shape encoding $\mathcal{S}$ is computed and used as input to the control policy $\pi$. The control policy is additionally conditioned on the desired object orientation $R_\text{g}$ and optionally a target speed signal $\xi$, which controls the speed of reorientation. The actions of the policy are low-pass filtered and given to an underlying impedance controller for the torque-controlled DLR-Hand II.
  • Figure 3: (Left) Success rate $b$ and (center) average time $\mathrm{T}$ required to reach the first goal are plotted over the training progress. Each line is the mean over three training runs, with shaded areas covering the min and max. We smooth the signal of the individual runs. (Right) Average angular velocity $\omega = \theta_0/T$ over evaluation episodes. We ran 1200 episodes with a single policy each and discarded failed episodes ($<$ 3%) and episodes where $\theta_0 < \pi / 4$ to avoid high variance due to small numbers and reorientations without regrasping.
  • Figure 4: Box plot of the time to reach the goal $T$ of evaluation episodes grouped by the target time $T_\text{d}$ rounded to the nearest integer. We ran 1200 episodes with a single policy each and discarded failed episodes ($<$ 3%). Whiskers indicate the 5th and 95th percentile. (Left) $H_\text{exp} = 2$ s. (Right) $H_\text{exp}$ is sampled between 0 and 1 s.
  • Figure 5: Scatter plot of the effective speed $\omega = \theta_0 / T$ against the target speed $\omega_\text{d}$ of individual evaluation episodes. The target speeds are sampled uniformly between 0.25 and $2.5rad/s$, the same as during the training. We only plot successful trials (success rate $b =$ 93.5%). The color indicates the initial angle $\theta_0$, showing no clear correlation between the initial angle $\theta_0$ and the difficulty for the policy to match the target speed. However, the variance increases significantly for small initial angles due to reorientations without regrasping.