Table of Contents
Fetching ...

Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing

Lukas Mack, Felix Grüninger, Benjamin A. Richardson, Regine Lendway, Katherine J. Kuchenbecker, Joerg Stueckler

TL;DR

This work addresses robust 3D object pose estimation during in-hand grasping under occlusion by fusing vision, proprioception, and binary low-resolution tactile sensing within a factor-graph framework. It models the object pose $\boldsymbol{\xi} \in SE(3)$ using visual measurements $\boldsymbol{\zeta}$, binary tactile readings $\mathbf{y}$, and hand configuration $\mathbf{q}$, incorporating a non-penetration prior and a CAD-based SDF for geometry. The method, implemented with a robust non-linear optimization and TheTheseus LM solver, demonstrates improved accuracy over vision-only baselines in both simulation and real-world experiments, achieving real-time performance around 13.3 Hz. This reveals that even coarse interior tactile sensing can meaningfully enhance in-hand pose estimation, enabling more reliable dexterous manipulation and manipulation planning in occluded scenarios.

Abstract

Accurate 3D pose estimation of grasped objects is an important prerequisite for robots to perform assembly or in-hand manipulation tasks, but object occlusion by the robot's own hand greatly increases the difficulty of this perceptual task. Here, we propose that combining visual information and proprioception with binary, low-resolution tactile contact measurements from across the interior surface of an articulated robotic hand can mitigate this issue. The visuo-tactile object-pose-estimation problem is formulated probabilistically in a factor graph. The pose of the object is optimized to align with the three kinds of measurements using a robust cost function to reduce the influence of visual or tactile outlier readings. The advantages of the proposed approach are first demonstrated in simulation: a custom 15-DoF robot hand with one binary tactile sensor per link grasps 17 YCB objects while observed by an RGB-D camera. This low-resolution in-hand tactile sensing significantly improves object-pose estimates under high occlusion and also high visual noise. We also show these benefits through grasping tests with a preliminary real version of our tactile hand, obtaining reasonable visuo-tactile estimates of object pose at approximately 13.3 Hz on average.

Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing

TL;DR

This work addresses robust 3D object pose estimation during in-hand grasping under occlusion by fusing vision, proprioception, and binary low-resolution tactile sensing within a factor-graph framework. It models the object pose using visual measurements , binary tactile readings , and hand configuration , incorporating a non-penetration prior and a CAD-based SDF for geometry. The method, implemented with a robust non-linear optimization and TheTheseus LM solver, demonstrates improved accuracy over vision-only baselines in both simulation and real-world experiments, achieving real-time performance around 13.3 Hz. This reveals that even coarse interior tactile sensing can meaningfully enhance in-hand pose estimation, enabling more reliable dexterous manipulation and manipulation planning in occluded scenarios.

Abstract

Accurate 3D pose estimation of grasped objects is an important prerequisite for robots to perform assembly or in-hand manipulation tasks, but object occlusion by the robot's own hand greatly increases the difficulty of this perceptual task. Here, we propose that combining visual information and proprioception with binary, low-resolution tactile contact measurements from across the interior surface of an articulated robotic hand can mitigate this issue. The visuo-tactile object-pose-estimation problem is formulated probabilistically in a factor graph. The pose of the object is optimized to align with the three kinds of measurements using a robust cost function to reduce the influence of visual or tactile outlier readings. The advantages of the proposed approach are first demonstrated in simulation: a custom 15-DoF robot hand with one binary tactile sensor per link grasps 17 YCB objects while observed by an RGB-D camera. This low-resolution in-hand tactile sensing significantly improves object-pose estimates under high occlusion and also high visual noise. We also show these benefits through grasping tests with a preliminary real version of our tactile hand, obtaining reasonable visuo-tactile estimates of object pose at approximately 13.3 Hz on average.

Paper Structure

This paper contains 16 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overview of our approach. We estimate the 3D pose of an object using factor graph optimization from visual pose, contact measurements, and proprioception. The visual pose measurement $\boldsymbol{\zeta}\in SE(3)$ is obtained by a deep learning-based approach (FoundationPose wen2023_foundationpose) and is used in a visual factor $f_{\mathit{vis}}$. Intersection constraints between the object and the hand in its current pose $\mathbf{q}$ are considered by a penetration factor $f_{\mathit{pen}}$. Binary contact measurements $\mathbf{y} \in \mathbb{B}^K$ are obtained per link from $K$ rectangular sensor pads and are used in a tactile factor $f_{\mathit{tac}}$ that also incorporates $\mathbf{q}$.
  • Figure 2: Tactile residuals measure the signed distance of contact points on each sensor pad to the object if the object does not penetrate the pad.
  • Figure 3: ADD-S statistics on the $D_{\mathit{vary}}$ dataset, comparing our visuo-tactile pose optimization against an ablation without tactile contacts and the vision baseline. Fusing vision and tactile information performs significantly better than both the ablation and the vision estimate (both $p<0.0001$).
  • Figure 4: ADD-S statistics at each time step on the $D_{\mathit{vary}}$ dataset, comparing our visuo-tactile pose optimization against the vision baseline. Fusing vision and tactile information significantly improves the accuracy of the pose estimate when the hand is rotated and the occlusion increases (after about 10 s), and it provides similar accuracy before the rotation.
  • Figure 5: ADD-S statistics on $D_{\mathit{vary}}$ during the first 4.5 s of the object grasps for five noise scales. Our approach of including tactile and penetration information improves accuracy significantly for all scales ($p<0.05$).
  • ...and 3 more figures