An Investigation of Multi-feature Extraction and Super-resolution with Fast Microphone Arrays

Eric T. Chang; Runsheng Wang; Peter Ballentine; Jingxi Xu; Trey Smith; Brian Coltin; Ioannis Kymissis; Matei Ciocarlie

An Investigation of Multi-feature Extraction and Super-resolution with Fast Microphone Arrays

Eric T. Chang, Runsheng Wang, Peter Ballentine, Jingxi Xu, Trey Smith, Brian Coltin, Ioannis Kymissis, Matei Ciocarlie

TL;DR

This work demonstrates that a sparse MEMS microphone array embedded under a PDMS layer can support multiple tactile tasks—texture classification, contact localization, and drag velocity estimation—using a transformer-based time-series analysis framework. By operating on short time windows of high-rate microphone data, the method achieves 77.3% texture accuracy (84.2% excluding the slowest velocity), 1.8 mm localization error, and about 5.6 mm/s velocity error, while exhibiting robustness to unseen velocities. The study also shows fast contact detection with average response times in the low-millisecond range, highlighting the potential of MEMS microphone arrays as a low-cost, space-efficient tactile modality that can complement other sensing modalities. Overall, the findings inform sensor design by illustrating what tactile information can be extracted from a sparse microphone network and how data-driven, time-series methods enable such capabilities.

Abstract

In this work, we use MEMS microphones as vibration sensors to simultaneously classify texture and estimate contact position and velocity. Vibration sensors are an important facet of both human and robotic tactile sensing, providing fast detection of contact and onset of slip. Microphones are an attractive option for implementing vibration sensing as they offer a fast response and can be sampled quickly, are affordable, and occupy a very small footprint. Our prototype sensor uses only a sparse array (8-9 mm spacing) of distributed MEMS microphones (<$1, 3.76 x 2.95 x 1.10 mm) embedded under an elastomer. We use transformer-based architectures for data analysis, taking advantage of the microphones' high sampling rate to run our models on time-series data as opposed to individual snapshots. This approach allows us to obtain 77.3% average accuracy on 4-class texture classification (84.2% when excluding the slowest drag velocity), 1.8 mm mean error on contact localization, and 5.6 mm/s mean error on contact velocity. We show that the learned texture and localization models are robust to varying velocity and generalize to unseen velocities. We also report that our sensor provides fast contact detection, an important advantage of fast transducers. This investigation illustrates the capabilities one can achieve with a MEMS microphone array alone, leaving valuable sensor real estate available for integration with complementary tactile sensing modalities.

An Investigation of Multi-feature Extraction and Super-resolution with Fast Microphone Arrays

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Microphone Array Design and Fabrication
Electronics
Fabrication
Data-Driven Multi-Feature Extraction
Training Data
Preprocessing
Learning Model Architecture
Training
Contact Characterization Performance
Texture Classification
Localization
Direct Velocity Regression
Response Time Analysis
...and 3 more sections

Figures (4)

Figure 1: Photo of the microphone array without the textured tape (a), and an exploded view of the sensor fabrication process (b). The inner square of four microphones are spaced $d_1 = 8$ mm apart. The outer ring of microphones are spaced $d_2 = 9$ mm away from the inner square.
Figure 2: Our tactile sensor on an F/T sensor (a) and the 4 textures used in data collection (b). The four textures ("a," "b," "c," "d") follow $x=$ 0 - 4.5 mm bump spacing and are printed on 15 mm diameter indenters. The bump diameter is determined by the formula $x \sqrt{\frac{2}{\pi}}$pestell2022_tactip. The PLA indenter shown in (a) was used for the response time data (Sec. \ref{['sec:responsetime']}).
Figure 3: An illustration of our network architecture. We apply 1-D temporal convolution on raw input signals to create latent representations. We feed these representations to a transformer encoder, followed by self-attention pooling and a linear output head.
Figure 4: Confusion matrices on 4 class texture classification. Each matrix is for a different test set, each of which correspond to the held-out velocities.

An Investigation of Multi-feature Extraction and Super-resolution with Fast Microphone Arrays

TL;DR

Abstract

An Investigation of Multi-feature Extraction and Super-resolution with Fast Microphone Arrays

Authors

TL;DR

Abstract

Table of Contents

Figures (4)