Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

Trevor Ablett; Oliver Limoyo; Adam Sigal; Affan Jilani; Jonathan Kelly; Kaleem Siddiqi; Francois Hogan; Gregory Dudek

Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

Trevor Ablett, Oliver Limoyo, Adam Sigal, Affan Jilani, Jonathan Kelly, Kaleem Siddiqi, Francois Hogan, Gregory Dudek

TL;DR

This work integrates a see through visuotactile sensor (STS) into imitation learning to tackle contact rich manipulation tasks involving relative motion. It introduces tactile force matching to recreate demonstrator force profiles and a learned policy based mode switching to transition between visual and tactile sensing at the moment of contact. The approach, evaluated on four door manipulation tasks, yields substantial gains: approximately $64.2\%$ system improvement, $62.5\%$ from force matching, $30.4\%$ from mode switching, and $42.5\%$ from including STS data as policy input, demonstrating the value of see through tactile sensing for data collection and policy execution. The results highlight the practical impact of multimodal sensing for robust manipulation, with clear benefits in data efficiency and task performance across observation configurations.

Abstract

Contact-rich tasks continue to present many challenges for robotic manipulation. In this work, we leverage a multimodal visuotactile sensor within the framework of imitation learning (IL) to perform contact-rich tasks that involve relative motion (e.g., slipping and sliding) between the end-effector and the manipulated object. We introduce two algorithmic contributions, tactile force matching and learned mode switching, as complimentary methods for improving IL. Tactile force matching enhances kinesthetic teaching by reading approximate forces during the demonstration and generating an adapted robot trajectory that recreates the recorded forces. Learned mode switching uses IL to couple visual and tactile sensor modes with the learned motion policy, simplifying the transition from reaching to contacting. We perform robotic manipulation experiments on four door-opening tasks with a variety of observation and algorithm configurations to study the utility of multimodal visuotactile sensing and our proposed improvements. Our results show that the inclusion of force matching raises average policy success rates by 62.5%, visuotactile mode switching by 30.3%, and visuotactile data as a policy input by 42.5%, emphasizing the value of see-through tactile sensing for IL, both for data collection to allow force matching, and for policy execution to enable accurate task feedback. Project site: https://papers.starslab.ca/sts-il/

Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

TL;DR

system improvement,

from force matching,

from mode switching, and

from including STS data as policy input, demonstrating the value of see through tactile sensing for data collection and policy execution. The results highlight the practical impact of multimodal sensing for robust manipulation, with clear benefits in data efficiency and task performance across observation configurations.

Abstract

Paper Structure (27 sections, 12 equations, 22 figures)

This paper contains 27 sections, 12 equations, 22 figures.

Introduction
Related Work
Methodology
Markov Decision Processes and Imitation Learning
Impedance Control
Data Collection with Kinesthetic Teaching
Force Matching
Measuring Unscaled Forces with A Tactile Sensor
Tactile Force Matching via Calibration
STS Sensor Mode Labelling
Policy Training
Experiments
Environment and Task Parameters
Imitation Learning and Training Parameters
System Performance
...and 12 more sections

Figures (22)

Figure 1: Our STS sensor before and during contact (right column) with a cabinet knob (middle column) during a door opening task (left column). In visual mode, the camera sees through the gel membrane, allowing the knob to be found, while tactile mode provides contact-based feedback, via gel deformation and resultant dot displacement, upon initial contact and during opening. Red circles highlight the knob in the sensor view.
Figure 2: Visual representations of each component of our system: (1) Raw, human demonstrations are generated via kinesthetic teaching. (2) During the demonstration, an STS sensor in tactile mode allows us to read a four dimensions of an unscaled wrench in $x$, $y$, $z$, and rotationally about $z$. (3) For each timestep $t$ from the demonstration trajectory from (1), each raw demonstration pose $\boldsymbol{\mathbf{x}} _{\text{raw},t}$ uses the linear calibration parameters $\boldsymbol{\mathbf{A}}$ and $\boldsymbol{\mathbf{b}}$ (relating unscaled $\tilde{\boldsymbol{\mathcal{F}}}$ from (2) to control error $\boldsymbol{\mathbf{e}}$) and the measured wrench $\tilde{\boldsymbol{\mathcal{F}}}_{\text{raw},t}$ from (2) to generate a force-matched replay pose $\boldsymbol{\mathbf{x}} ^d_{\text{rep},t}$. (4) The new, modified replay poses are used to replay the demonstration while a human provides an STS mode switch label. These replayed, force-matched demonstrations are stored in an expert dataset containing STS, wrist camera, and relative pose data as observations, as well as robot motion and STS mode labels as actions. (5) We train policies using some or all of STS, wrist camera, and relative pose data with behavior cloning.
Figure 3: Various human approaches to opening a cabinet. The "Press" approach on the right requires far less arm rotation, but also generates relative motion between the knob and the hand, motivating the use of high-resolution tactile sensing to replicate.
Figure 4: With a human hand generating $\vec{F}_h$ and $\vec{\tau}_h$, wrenches measured at the wrist ($\vec{F}_w, \vec{\tau}_w$) via typical force-torque sensing modalities cannot isolate $\vec{F}_d$, as required for the force matching procedure outlined in \ref{['sec:method_force_matching']}. This notation applies only to this figure.
Figure 5: Example raw images along with corresponding marker displacements, inferred depths jilaniTactileRecoveryShape2024, and $e_z$ and $\Tilde{\mathcal{F}}_z$ values, along with the piecewise linear relationship between $e_z$ and $\Tilde{\mathcal{F}}_z$. See supplementary materials for corresponding video.
...and 17 more figures

Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

TL;DR

Abstract

Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

Authors

TL;DR

Abstract

Table of Contents

Figures (22)