Table of Contents
Fetching ...

Adding internal audio sensing to internal vision enables human-like in-hand fabric recognition with soft robotic fingertips

Iris Andrussow, Jans Solano, Benjamin A. Richardson, Georg Martius, Katherine J. Kuchenbecker

TL;DR

This work presents a system that can sense both types of haptic information, and investigates how each type influences robotic tactile perception of fabrics, and achieves a maximum fabric classification accuracy of 97% on a dataset of 20 common fabrics.

Abstract

Distinguishing the feel of smooth silk from coarse cotton is a trivial everyday task for humans. When exploring such fabrics, fingertip skin senses both spatio-temporal force patterns and texture-induced vibrations that are integrated to form a haptic representation of the explored material. It is challenging to reproduce this rich, dynamic perceptual capability in robots because tactile sensors typically cannot achieve both high spatial resolution and high temporal sampling rate. In this work, we present a system that can sense both types of haptic information, and we investigate how each type influences robotic tactile perception of fabrics. Our robotic hand's middle finger and thumb each feature a soft tactile sensor: one is the open-source Minsight sensor that uses an internal camera to measure fingertip deformation and force at 50 Hz, and the other is our new sensor Minsound that captures vibrations through an internal MEMS microphone with a bandwidth from 50 Hz to 15 kHz. Inspired by the movements humans make to evaluate fabrics, our robot actively encloses and rubs folded fabric samples between its two sensitive fingers. Our results test the influence of each sensing modality on overall classification performance, showing high utility for the audio-based sensor. Our transformer-based method achieves a maximum fabric classification accuracy of 97 % on a dataset of 20 common fabrics. Incorporating an external microphone away from Minsound increases our method's robustness in loud ambient noise conditions. To show that this audio-visual tactile sensing approach generalizes beyond the training data, we learn general representations of fabric stretchiness, thickness, and roughness.

Adding internal audio sensing to internal vision enables human-like in-hand fabric recognition with soft robotic fingertips

TL;DR

This work presents a system that can sense both types of haptic information, and investigates how each type influences robotic tactile perception of fabrics, and achieves a maximum fabric classification accuracy of 97% on a dataset of 20 common fabrics.

Abstract

Distinguishing the feel of smooth silk from coarse cotton is a trivial everyday task for humans. When exploring such fabrics, fingertip skin senses both spatio-temporal force patterns and texture-induced vibrations that are integrated to form a haptic representation of the explored material. It is challenging to reproduce this rich, dynamic perceptual capability in robots because tactile sensors typically cannot achieve both high spatial resolution and high temporal sampling rate. In this work, we present a system that can sense both types of haptic information, and we investigate how each type influences robotic tactile perception of fabrics. Our robotic hand's middle finger and thumb each feature a soft tactile sensor: one is the open-source Minsight sensor that uses an internal camera to measure fingertip deformation and force at 50 Hz, and the other is our new sensor Minsound that captures vibrations through an internal MEMS microphone with a bandwidth from 50 Hz to 15 kHz. Inspired by the movements humans make to evaluate fabrics, our robot actively encloses and rubs folded fabric samples between its two sensitive fingers. Our results test the influence of each sensing modality on overall classification performance, showing high utility for the audio-based sensor. Our transformer-based method achieves a maximum fabric classification accuracy of 97 % on a dataset of 20 common fabrics. Incorporating an external microphone away from Minsound increases our method's robustness in loud ambient noise conditions. To show that this audio-visual tactile sensing approach generalizes beyond the training data, we learn general representations of fabric stretchiness, thickness, and roughness.
Paper Structure (21 sections, 8 figures, 4 tables)

This paper contains 21 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Setup: Recognizing common fabrics using multimodal input from vision- and audio-based fingertip sensors on a robot hand.
  • Figure 2: Multimodal data collection setup: A four-fingered robot hand is equipped with two tactile fingertips that measure different modalities of touch. The middle finger terminates with the vision-based tactile sensor Minsight, which delivers high-resolution internal images of the soft fingertip deformation. These images are used to infer a contact force map and to calculate optical flow over consecutive frames. The robot thumb is equipped with Minsound, our soft microphone-based sensor, which records broad-bandwidth audio at a sampling rate of 48 kHz. An identical microphone is mounted on the side of the robot's palm to record environment noise.
  • Figure 3: Data collection: During the entire exploratory procedure, we stream camera images from the vision-based tactile sensor at 50 Hz, audio data from the internal and external microphones at 48 kHz, and joint angles, currents and velocities for all six involved motors at 50 Hz (every 20 ms). For each image and the corresponding proprioceptive data, we record a window of microphone data corresponding to the most recent 2048 audio data points (42.67 ms) recorded before the image was captured.
  • Figure 4: Multimodal classification architecture: We process fabric interactions in sequences of $N=200$ time steps (4 s of data). Each modality is processed by an encoder head, and the respective features are concatenated and normalized. A position embedding is added to each overall feature vector, and the sequence of all resulting vectors is processed by a sequential backbone, consisting of three multi-head attention layers or a temporal convolutional network (TCN). The outputs of the backbone are further processed by a classification head that consists of two fully connected layers followed by a softmax function to yield the output.
  • Figure 5: Close-up images of all 20 fabrics used for training plus three holdout fabrics (21--23) to test generalization: Class 0 is only the robot fingers moving against each other without any fabric. The depicted side of each material is folded inward during data collection. The dataset corresponding to this paper provides verbal descriptions and property categories (stretchiness, thickness and roughness, visualized here by pictograms) of all fabric samples fabricdataset.
  • ...and 3 more figures