Estimator-Coupled Reinforcement Learning for Robust Purely Tactile In-Hand Manipulation

Lennart Röstel; Johannes Pitz; Leon Sievers; Berthold Bäuml

Estimator-Coupled Reinforcement Learning for Robust Purely Tactile In-Hand Manipulation

Lennart Röstel, Johannes Pitz, Leon Sievers, Berthold Bäuml

TL;DR

This work tackles robust purely tactile in-hand manipulation by coupling state estimation with reinforcement learning in a unified training loop (EcRL). By training the estimator and policy concurrently and conditioning the policy on the estimated state, the approach mitigates issues from estimator bias and stochastic contact dynamics, enabling robust reorientation of diverse objects with the DLR-Hand II and sim-to-real transfer. The method achieves rapid learning (median 6.5 hours on a single low-cost GPU) and demonstrates up to nine consecutive cube reorientations, surpassing prior tactile-only methods. The results show strong sim2real transfer and robust performance across multiple object geometries, highlighting the practical impact of estimator-aware, concurrent training for dexterous, vision-free manipulation.

Abstract

This paper identifies and addresses the problems with naively combining (reinforcement) learning-based controllers and state estimators for robotic in-hand manipulation. Specifically, we tackle the challenging task of purely tactile, goal-conditioned, dextrous in-hand reorientation with the hand pointing downwards. Due to the limited sensing available, many control strategies that are feasible in simulation when having full knowledge of the object's state do not allow for accurate state estimation. Hence, separately training the controller and the estimator and combining the two at test time leads to poor performance. We solve this problem by coupling the control policy to the state estimator already during training in simulation. This approach leads to more robust state estimation and overall higher performance on the task while maintaining an interpretability advantage over end-to-end policy learning. With our GPU-accelerated implementation, learning from scratch takes a median training time of only 6.5 hours on a single, low-cost GPU. In simulation experiments with the DLR-Hand II and for four significantly different object shapes, we provide an in-depth analysis of the performance of our approach. We demonstrate the successful sim2real transfer by rotating the four objects to all 24 orientations in the $π/2$ discretization of SO(3), which has never been achieved for such a diverse set of shapes. Finally, our method can reorient a cube consecutively to nine goals (median), which was beyond the reach of previous methods in this challenging setting.

Estimator-Coupled Reinforcement Learning for Robust Purely Tactile In-Hand Manipulation

TL;DR

Abstract

discretization of SO(3), which has never been achieved for such a diverse set of shapes. Finally, our method can reorient a cube consecutively to nine goals (median), which was beyond the reach of previous methods in this challenging setting.

Paper Structure (25 sections, 4 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Contributions
Related Work
Estimator-Coupled Reinforcement Learning
Motivation
Concurrent Learning Scheme
Estimator Learning
Policy Learning
Application to blind, goal-conditioned in-hand manipulation
Task Description
System Description
Reinforcement Learning Environment
Episode
Observations
Reward
...and 10 more sections

Figures (7)

Figure 1: In-hand manipulation of different object shapes with the DLR-Hand II Butterfass2001. The object set and the benchmark results are depicted in \ref{['fig:bench']}. The task consists of deliberately reorienting the objects to an externally specified target orientation. The shown setting is especially challenging as the hand is oriented downwards, hence demanding permanent force closure. The manipulation is performed blindly, i.e., without cameras, using high-fidelity joint torque sensing for purely tactile tracking of the object pose.
Figure 2: Left: Agile Justin Bauml2014 performing in-hand manipulation while being blindfolded. Right: The goal of the task is to bring the object orientation $R$ to a goal orientation $R_g$. The estimated state $\hat{R}$ is visualized in transparent blue.
Figure 3: Two failure cases produced as rollouts of a non-robust control policy performing purely tactile in-hand reorientation of a cube. a): during reorientation towards a goal, the cube is held using only two fingers, allowing the cube to tip about the blue axis. Because the occurrence and extent of the tipping behavior depend, i.a., on unknowns, like friction effects, this effectively leads to stochastic dynamics and consequently an ambiguous estimate of the object rotation. This is seen in the bottom figure, showing the angle to the target orientation $d(R,R_g)$ for multiple trials of the same rotation sequence in gray. The trajectory corresponding to the manipulation sequence shown on top is indicated in blue. Due to the object symmetry, these ambiguities in the estimate, in many cases, can not be resolved, leading to a failure of the task due to permanent loss of observability. b): the controller is tasked to perform rotations around the vertical axis (by setting new goals that are $\pi$ rad rotated every 5s). Over the course of this manipulation, the estimated position (dashed line) drifts away from the ground truth position (solid line) in the $\mathrm{x_3}$-axis as it can hardly be determined by lateral contact measurements (compare Pitz2023dextrous). The non-robust controller assumes the position to be unbiased, which leads to the cube being dropped.
Figure 4: The system state $s$ is advanced in time by the simulator or the real robot. From the observation $z$, the state estimator $f_{\phi}$ recurrently produces an estimate of the state $\hat{s}$. Based in the current observation and the predicted state, the policy $\pi_{\varphi}$ computes control inputs $u$.
Figure 5: Success rates $B$ on the in-hand reorientation benchmark for each of the 24 possible goal rotations column-wise and each of the considered objects row-wise. The goal indices, as well as the associated coordinate transformation, are indicated on the horizontal axis. Note that index 3 is the identity rotation for which the controller needs to hold the object for 5 seconds without rotating it. The four considered object shapes are shown on the right. Each dot represents the success rate over 50 trials in simulation. Black horizontal bars indicate results of EcRL on the real system for the 2 goal orientations which perform best and worst respectively in simulation for each object.
...and 2 more figures

Estimator-Coupled Reinforcement Learning for Robust Purely Tactile In-Hand Manipulation

TL;DR

Abstract

Estimator-Coupled Reinforcement Learning for Robust Purely Tactile In-Hand Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)